kernel crash on Intel(R) Xeon(TM) CPU 2.66GHz 4-cpus machine
jec at rptec.ch
Mon Aug 21 12:33:47 UTC 2006
> After leaving the new 4 simulations runnning, I've been out for the
> weekend and come back this morning to find out the pc completely dead
> (power was on, but monitor was black, no answer to any input device
> stimulus). The first time I run the simulations, all data were written
> on the external (usb) disk: fearing this could have been the cause of
> the first failure, I run this new simulations entirely on /dev/hda. But
> it crashed again.
>> Test memory. That's the main cause I've seen that freeze the machine.
> I've just tested it, with memtest86 (ver. 1.65). It passed the test
> completely with no errors.
OK, one thing left to investigate. Just to be sure, how long have you
run the test? It's about 2h for a single step.
>> > Also, just to be sure, what is the machine? Dell? Have you a SCSI RAID
> card? PERC3?
> The machine is a HP (workstation xw6000). I suppose to have a pci scsi
> card (it appears in the bios setup, at the voice pci devices, occupying
> two irqs), and cat /proc/scsi/scsi returns
OK, the problem I had (known and recognized by Dell) is only for PERC3
adapters. Doesn't matter for your case.
>> > And since his machine is 3 years old, I think it's not a 64bits
>> capable > one...
> yes, I would exclude 64bit, it's a Xeon DP 2658Mhz; in fact I've just
> discovered (looking at system config in the BIOS setup) that I have
> *two* processors (this is not my machine, I've been started working on
> it recently), which are of course not dual core... therefore I really do
> not understand why linux recognized 4 processors!!
It's a HyperThreading processor. It's *seen* as 2 processors / CPU but
in fact it's just useless in the majority of situation. We ended up
disabling it in the BIOS for ours...
> today I can't still find any hint in the many log files; I just report
> you a message which is often repeated (at random times):
> kernel: [17203596.832000] Inbound IN=eth0 OUT= MAC=xxx SRC=220.127.116.11
> DST=xxx LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=1643 DF PROTO=TCP SPT=2490
> DPT=10000 WINDOW=65535 RES=0x00 SYN URGP=0
It's probably iptables firewalling messages. Harmless. Packets dropped
> I attach to this message my "dmesg" actual file (the one containing the
> buffer since the last boot of this morning), maybe you can discover some
> clue, for example why it recognizes FOUR processors
There seems to be nothing special.
>> > You could also send the syslog to a remote machine, so if there is a
>> > disk problem, you have more chances getting the log message if any.
>> > Ben, would the kdump kernel be useful in this situation?
>> The kdump would help, but I haven't setup any useful tools for it yet.
> what does it mean? it's unusable or I can try it?
> meanwhile I will try to use the machine without executing simulations
Things to try now:
- Disable HyperThreading
- Disable one CPU => You end up with only one CPU
- Try again your simulation.
- Remove all devices you don't need. Keep only one disk for example.
- Run simulation again.
More information about the kernel-team