[Bug 651370] Re: ec2 kernel crash invalid opcode 0000 [#1]
Brandon Black
651370 at bugs.launchpad.net
Tue Oct 26 07:55:10 UTC 2010
I tried to look in more detail at the crash this evening, because it's really causing me a lot of headache now. The most recent time I tried to boot a new c1.xlarge in us-east-1 this evening, I had to cycle through the crash/terminate/relaunch cycle 7 times before I got a working instance. I don't have a patch or answer yet, but I have a lot of hints:
1) c1.xlarge seems to be going through some changes of underlying
CPU/hardware, which could explain the randomness. It probably depends
which hardware you land on. The older ones are Xeon E5410 and the newer
ones are Xeon E5506. So far the only times I've gotten non-crashed
launches and thought to check, they've all been the E5410's.
2) The exact instruction throwing invalid opcode is MONITOR (0f 01 c8).
The instructions MONITOR and MWAIT are used for efficient idling on
newer CPUs, which I guess is the whole point of the intel_idle code
we're crashing in.
3) These are not the sorts of instructions that can be executed in a VM
environment like Xen without special support. Googling reveals
discussions/patches to Xen for supporting these instructions in various
ways (either as a hypercall encapsulating the whole monitor/wait pair,
or masking the capability in CPUID so that Linux doesn't detect support
and doesn't try to use it all). Various related links:
http://lists.xensource.com/archives/html/xen-devel/2010-04/msg00043.html
http://markmail.org/thread/terab63w744x3m2r
http://www.sfr-fresh.com/unix/misc/xen-4.0.1.tar.gz:a/xen-4.0.1/docs/misc/cpuid-config-for-guest.txt
4) intel_idle can be effectively disabled from the kernel commandline
with intel_idle.max_cstate=0 ( http://kerneltrap.org/mailarchive/git-
commits-head/2010/5/28/40718 ), which will fall back on acpi_idle
behavior. If it still crashes, there's also a commandline flag
"idle=nomwait" which might prevent acpi_idle from using mwait as well.
I don't know at this point where the true bug lies. It could be that
the intel_idle code needs to make an exception to its detection routines
under Xen. It could be that some of Amazon's Xen hosts are configured
differently (wrt CPUID masking for mwait) than others. It could be any
of a number of related things. However, I suspect new AMIs for Maverick
on EC2 that disable mwait from the commandline in grub.conf/menu.lst per
above might fix this. I'll try making my own AMIs with this change in
the morning and see how it goes.
--
ec2 kernel crash invalid opcode 0000 [#1]
https://bugs.launchpad.net/bugs/651370
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux in ubuntu.
More information about the kernel-bugs
mailing list