Crash diagnostics help

Hal Burgiss hal at dbsinteractive.com
Thu Feb 4 14:38:43 UTC 2010


Hello,

I am reporting this here, since I am not real sure what package to file this
against. Presumably this is a kernel problem, but it has some curious aspects
to it and thought I might try to get advice about where to go next.

Quick Summary: server crash
Current Hardware: Dell PowerEdge 2650
Platform: 8.04 (updated)
Kernel: Linux www 2.6.24-26-server #1 SMP Tue Dec 1 19:19:20 UTC 2009 i686
GNU/Linux
Profile: Web server with vhosted clients, and basic LAMP functionality.
Typical load: less than .20, rarely above .50

Symptom summary: System fails to fully respond. System is running, and
answers pings quite normally, but ALL servers fail to respond (apache, sshd,
etc), requiring a reboot to restore "normal" functionaltiy.

Related log data: None.


I've run into a troubling situation that has followed me from one hardware
profile to something radically different, with the same nasty results. As
mentioned above this system supports several client web sites. Its main
purpose is Apache/php. Mysql is running on a separate system. ftp is installed
but firewalled and really not used. Mail is only there to relay out mail from
the vhosted web clients. No incoming mail.

What is most troubling is that 2 months ago we moved everything from a
completely different 8.04 system (an IBM x330 server) because of the same
problem, eg system dies mysteriously with no log data, pings normally, nmap
shows all services running, but none of those services respond fully. I had
assumed we had some obscure hardware related problem, and moved all the
clients over to the current system. But something else is going on since the
problem has followed me to the current system. 

The best I can get from the logs is that the last Apache request was served at
16:40. Syslogd lefts its ---MARK--- thing in syslog for the last time at
16:56, which is the last entry that I can find in any log, until a reboot at
17:33. Absolutely nothing unusual in syslog, kern.log, or any other log,
during any of this timeframe. Nothing real unusual in any Apache log either.

I have reported a strange php/suhosin related error to the Ubuntu php team,
that is memory related
(https://bugs.launchpad.net/ubuntu/+source/php5/+bug/503396), and could be
related to this somehow. Possibly something happened there, and it was not
able to be logged. Hard to say. 

As another note, I have several systems running 8.04 now with very like
configurations and these issues have not been a problem (except the previous
incarnation of this particular system).

Remote diagnostics after the problem started at approx 17:10:

$ ping www.example.net                              
PING www.example.net (212.253.111.163) 56(84) bytes of data.                 
64 bytes from www.example.net (212.253.111.163): icmp_seq=1 ttl=63 time=4.67 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=2 ttl=63 time=4.61 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=3 ttl=63 time=4.39 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=4 ttl=63 time=3.99 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=5 ttl=63 time=3.78 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=6 ttl=63 time=4.77 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=7 ttl=63 time=4.57 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=8 ttl=63 time=4.42 ms
^C                                                                                     
--- www.example.net ping statistics ---                                      
8 packets transmitted, 8 received, 0% packet loss, time 7007ms       


Starting Nmap 4.76 ( http://nmap.org ) at 2010-02-03 17:14 EST
Interesting ports on www.example.net (212.253.111.163):
Not shown: 994 closed ports                                    
PORT     STATE    SERVICE                                      
21/tcp   open     ftp                                          
22/tcp   open     ssh                                          
25/tcp   open     smtp                                         
80/tcp   open     http                                         
443/tcp  open     https                                        
1720/tcp filtered H.323/Q.931

Everything *looks* very normal at this point. But none of those servers fully
respond and can't open a usable connection. There is not even any indication
of attempted logins despite multiple attempts at new ssh connections. A
pre-existing ssh connection that had been opened for weeks, was likewise
totally unresponsive. The patient looks alive, but is quite dead.

wget -S www.example.net
--2010-02-03 17:16:08--  http://www.example.net/
Resolving www.example.net... 212.253.111.163
Connecting to www.example.net|212.253.111.163|:80... connected.
HTTP request sent, awaiting response... ^C

Hangs at that point. Same with ssh. All other systems in the same rack
and connected to the same switch, are 100% normal at this time too. 

Thanks.


-- 
Hal Burgiss
DBS>Interactive
Manager Technical Services




More information about the kernel-team mailing list