Crash diagnostics help
hal at dbsinteractive.com
Thu Feb 4 14:38:43 UTC 2010
I am reporting this here, since I am not real sure what package to file this
against. Presumably this is a kernel problem, but it has some curious aspects
to it and thought I might try to get advice about where to go next.
Quick Summary: server crash
Current Hardware: Dell PowerEdge 2650
Platform: 8.04 (updated)
Kernel: Linux www 2.6.24-26-server #1 SMP Tue Dec 1 19:19:20 UTC 2009 i686
Profile: Web server with vhosted clients, and basic LAMP functionality.
Typical load: less than .20, rarely above .50
Symptom summary: System fails to fully respond. System is running, and
answers pings quite normally, but ALL servers fail to respond (apache, sshd,
etc), requiring a reboot to restore "normal" functionaltiy.
Related log data: None.
I've run into a troubling situation that has followed me from one hardware
profile to something radically different, with the same nasty results. As
mentioned above this system supports several client web sites. Its main
purpose is Apache/php. Mysql is running on a separate system. ftp is installed
but firewalled and really not used. Mail is only there to relay out mail from
the vhosted web clients. No incoming mail.
What is most troubling is that 2 months ago we moved everything from a
completely different 8.04 system (an IBM x330 server) because of the same
problem, eg system dies mysteriously with no log data, pings normally, nmap
shows all services running, but none of those services respond fully. I had
assumed we had some obscure hardware related problem, and moved all the
clients over to the current system. But something else is going on since the
problem has followed me to the current system.
The best I can get from the logs is that the last Apache request was served at
16:40. Syslogd lefts its ---MARK--- thing in syslog for the last time at
16:56, which is the last entry that I can find in any log, until a reboot at
17:33. Absolutely nothing unusual in syslog, kern.log, or any other log,
during any of this timeframe. Nothing real unusual in any Apache log either.
I have reported a strange php/suhosin related error to the Ubuntu php team,
that is memory related
(https://bugs.launchpad.net/ubuntu/+source/php5/+bug/503396), and could be
related to this somehow. Possibly something happened there, and it was not
able to be logged. Hard to say.
As another note, I have several systems running 8.04 now with very like
configurations and these issues have not been a problem (except the previous
incarnation of this particular system).
Remote diagnostics after the problem started at approx 17:10:
$ ping www.example.net
PING www.example.net (184.108.40.206) 56(84) bytes of data.
64 bytes from www.example.net (220.127.116.11): icmp_seq=1 ttl=63 time=4.67 ms
64 bytes from www.example.net (18.104.22.168): icmp_seq=2 ttl=63 time=4.61 ms
64 bytes from www.example.net (22.214.171.124): icmp_seq=3 ttl=63 time=4.39 ms
64 bytes from www.example.net (126.96.36.199): icmp_seq=4 ttl=63 time=3.99 ms
64 bytes from www.example.net (188.8.131.52): icmp_seq=5 ttl=63 time=3.78 ms
64 bytes from www.example.net (184.108.40.206): icmp_seq=6 ttl=63 time=4.77 ms
64 bytes from www.example.net (220.127.116.11): icmp_seq=7 ttl=63 time=4.57 ms
64 bytes from www.example.net (18.104.22.168): icmp_seq=8 ttl=63 time=4.42 ms
--- www.example.net ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7007ms
Starting Nmap 4.76 ( http://nmap.org ) at 2010-02-03 17:14 EST
Interesting ports on www.example.net (22.214.171.124):
Not shown: 994 closed ports
PORT STATE SERVICE
21/tcp open ftp
22/tcp open ssh
25/tcp open smtp
80/tcp open http
443/tcp open https
1720/tcp filtered H.323/Q.931
Everything *looks* very normal at this point. But none of those servers fully
respond and can't open a usable connection. There is not even any indication
of attempted logins despite multiple attempts at new ssh connections. A
pre-existing ssh connection that had been opened for weeks, was likewise
totally unresponsive. The patient looks alive, but is quite dead.
wget -S www.example.net
--2010-02-03 17:16:08-- http://www.example.net/
Resolving www.example.net... 126.96.36.199
Connecting to www.example.net|188.8.131.52|:80... connected.
HTTP request sent, awaiting response... ^C
Hangs at that point. Same with ssh. All other systems in the same rack
and connected to the same switch, are 100% normal at this time too.
Manager Technical Services
More information about the kernel-team