Server stops responding

Hal Burgiss hal at burgiss.net
Sun Jul 26 13:38:28 UTC 2009


I have an issue with an 8.04 server, that about once a month, stops
responding. It doesn't "crash", really, it just stops responding.

Testing open ports:

$ nmap example.com

Starting nmap 3.70 ( http://www.insecure.org/nmap/ ) at 2009-07-26 08:33 EDT
Interesting ports on example.com:
(The 1655 ports scanned but not shown below are in state: closed)
PORT     STATE SERVICE
22/tcp   open  ssh
25/tcp   open  smtp
80/tcp   open  http
443/tcp  open  https
3306/tcp open  mysql

Looks good. Problem is none of those will fully establish connection. An
attempt to connect via ssh:

$ tcpdump -v host example.com
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

08:35:07.529666 IP (tos 0x0, ttl  64, id 63108, offset 0, flags [DF], proto 6,
length: 60) example2.com.48625 > example.com.ssh: S
[tcp sum ok] 365499356:365499356(0) win 5840 <mss 1460,sackOK,timestamp
3810846040 0,nop,wscale 2>

08:35:07.530225 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto 6,
length: 60) example.com.ssh > example2.com.48625: S
[tcp sum ok] 2913998847:2913998847(0) ack 365499357 win 5792 <mss
1460,sackOK,timestamp 143947824 3810846040,nop,wscale 6>

08:35:07.530281 IP (tos 0x0, ttl  64, id 63110, offset 0, flags [DF], proto 6,
length: 52) example2.com.48625 > example.com.ssh: .
[tcp sum ok] ack 1 win 1460 <nop,nop,timestamp 3810846041 143947824>

But it dies right there. No further response at all. Consistently. Ever. Until
the reset button is hit. Then runs flawlessly for a month or so.

Typically what I find if I dig through log files is the system clock seems to
get wierd. Example just prior to system going belly up:


65.55.110.76 - - [26/Jul/2009:06:51:08 -0400] "GET
/academic-programs/teacher-education/ba-elementary-p-5 HTTP/1.1" 200

65.55.110.76 - - [26/Jul/2009:06:51:08 -0400] "GET
/academic-programs/teacher-education/ba-elementary-p-5 HTTP/1.1" 200

123.149.115.33 - - [26/Jul/2009:06:41:34 -0400] "GET
/academic-programs/teacher-education/ HTTP/1.1" 404 -

123.149.115.33 - - [26/Jul/2009:06:41:34 -0400] "GET
/academic-programs/teacher-education/ HTTP/1.1" 404 - "-" "-"

74.6.22.182 - - [26/Jul/2009:07:45:07 -0400] "GET
/alumni_development/endowingCampaign.html HTTP/1.0" 404 20

74.6.22.182 - - [26/Jul/2009:07:45:07 -0400] "GET
/alumni_development/endowingCampaign.html HTTP/1.0" 404 20 "-" "Mozil

65.55.210.87 - - [26/Jul/2009:06:58:03 -0400] "GET
/future-students/grad/why-mc
HTTP/1.1" 200 20

65.55.210.87 - - [26/Jul/2009:06:58:03 -0400] "GET
/future-students/grad/why-mc
HTTP/1.1" 200 20 "-" "msnbot/1.1 (+http

74.6.22.182 - - [26/Jul/2009:07:45:08 -0400] "GET
/calendar/athletics/2009-07-02
HTTP/1.0" 404 20

74.6.22.182 - - [26/Jul/2009:07:45:08 -0400] "GET
/calendar/athletics/2009-07-02
HTTP/1.0" 404 20 "-" "Mozilla/5.0 (com

123.149.115.33 - - [26/Jul/2009:06:41:32 -0400] "GET
/academic-programs/academic-calendar/ HTTP/1.1" 404 -

123.149.115.33 - - [26/Jul/2009:06:41:32 -0400] "GET
/academic-programs/academic-calendar/ HTTP/1.1" 404 - "-" "-"

This is a pretty active site. The correct time was 6:41.

Typically there is not anything interesting in syslog, but this time there was
a bunch oom-killer actions against apache processes at 7:45. The time is wrong
and after the wierdness started so I don't know whether to trust this. Or
whether its an effect or a cause of another problem.

This server is headless in a datacenter, so I am limited with what I can do
remotely (especially if I can't connect).

Any ideas how to hunt this down?

-- 
Hal




More information about the ubuntu-users mailing list