kernel crash on Intel(R) Xeon(TM) CPU 2.66GHz 4-cpus machine

Jean-Eric Cuendet jec at rptec.ch
Mon Aug 21 12:33:47 UTC 2006


> After leaving the new 4 simulations runnning, I've been out for the 
> weekend and come back this morning to find out the pc completely dead 
> (power was on, but monitor was black, no answer to any input device 
> stimulus). The first time I run the simulations, all data were written 
> on the external (usb) disk: fearing this could have been the cause of 
> the first failure, I run this new simulations entirely on /dev/hda. But 
> it crashed again.

Sad...

>> Test memory. That's the main cause I've seen that freeze the machine.
> 
> I've just tested it, with memtest86 (ver. 1.65). It passed the test 
> completely with no errors.

OK, one thing left to investigate. Just to be sure, how long have you 
run the test? It's about 2h for a single step.

>> > Also, just to be sure, what is the machine? Dell? Have you a SCSI RAID
> card? PERC3?
> 
> The machine is a HP (workstation xw6000). I suppose to have a pci scsi 
> card (it appears in the bios setup, at the voice pci devices, occupying 
> two irqs), and cat /proc/scsi/scsi returns

OK, the problem I had (known and recognized by Dell) is only for PERC3 
adapters. Doesn't matter for your case.

>> > And since his machine is 3 years old, I think it's not a 64bits 
>> capable > one...
> 
> yes, I would exclude 64bit, it's a Xeon DP 2658Mhz; in fact I've just 
> discovered (looking at system config in the BIOS setup) that I have 
> *two* processors (this is not my machine, I've been started working on 
> it recently), which are of course not dual core... therefore I really do 
> not understand why linux recognized 4 processors!!

It's a HyperThreading processor. It's *seen* as 2 processors / CPU but 
in fact it's just useless in the majority of situation. We ended up 
disabling it in the BIOS for ours...

> today I can't still find any hint in the many log files; I just report 
> you a message which is often repeated (at random times):
> 
> kernel: [17203596.832000] Inbound IN=eth0 OUT= MAC=xxx SRC=38.119.250.76 
> DST=xxx LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=1643 DF PROTO=TCP SPT=2490 
> DPT=10000 WINDOW=65535 RES=0x00 SYN URGP=0

It's probably iptables firewalling messages. Harmless. Packets dropped 
from network.

> I attach to this message my "dmesg" actual file (the one containing the 
> buffer since the last boot of this morning), maybe you can discover some 
> clue, for example why it recognizes FOUR processors

There seems to be nothing special.

>> > You could also send the syslog to a remote machine, so if there is a 
>> > disk problem, you have more chances getting the log message if any. 
>> > Ben, would the kdump kernel be useful in this situation?
>>
>> The kdump would help, but I haven't setup any useful tools for it yet.
> 
> what does it mean? it's unusable or I can try it?

Ben?

> meanwhile I will try to use the machine without executing simulations

Things to try now:
- Disable HyperThreading
- Disable one CPU => You end up with only one CPU
- Try again your simulation.

Then:
- Remove all devices you don't need. Keep only one disk for example.
- Run simulation again.
-jec





More information about the kernel-team mailing list