Server stops responding

CLIFFORD ILKAY clifford_ilkay at dinamis.com
Sun Jul 26 18:23:55 UTC 2009


On 26/07/09 09:38 AM, Hal Burgiss wrote:
> I have an issue with an 8.04 server, that about once a month, stops
> responding. It doesn't "crash", really, it just stops responding.
[snip]
> This is a pretty active site. The correct time was 6:41.
> 
> Typically there is not anything interesting in syslog, but this time there was
> a bunch oom-killer actions against apache processes at 7:45. The time is wrong
> and after the wierdness started so I don't know whether to trust this. Or
> whether its an effect or a cause of another problem.
> 
> This server is headless in a datacenter, so I am limited with what I can do
> remotely (especially if I can't connect).
> 
> Any ideas how to hunt this down?

With the sketchy information you've provided, I can only make educated
guesses. Here are the things we know. You're running Apache. Your
machine is becoming unresponsive at times. You're seeing "out of memory"
kill actions against Apache processes in your logs. Your site is "pretty
active".

Looking at your URLs, I'm guessing you're probably running some sort of
a database-backed CMS. If that is the case, I've seen similar problems
many times, particularly when the database in question is MySQL. If your
tables are MyISAM tables and the CMS is doing frequent inserts or
updates to one or more tables, you'll see a cascade effect where new
readers are being blocked by previous writers. As read requests pile up
waiting for MySQL table locks to be released, new requests keep getting
added to the queue, each one causing Apache to fork new processes, each
of those consuming significant amounts of RAM. The machine will
eventually exhaust all physical and virtual memory. If you hit swap
because of this, performance will go down the drain anyway.

While it would be useful to instrument the machine with something like
Munin <http://munin.projects.linpro.no/> to see what is going on, you
could also do a "show status" in MySQL and look at the ratio of
"Table_locks_immediate" to "Table_locks_waited". If you're getting a
significant number of "Table_locks_waited", it means MySQL had to wait
for a lock to be released. If this is the issue, you can convert the
tables that are causing problems to InnoDB. Better yet, you can upgrade
to PostgreSQL, if you can. :)

By the way, simply throwing more hardware at the problem isn't always
the answer. We're helping a client sort out exactly this sort of a
problem with a site they're running on a dual quad-core Xeon machine
with 8GB of RAM. That's an astounding amount of computing power just to
power a CMS on a moderately busy web site so it's clear that this is a
problem that should be resolvable without having to throw more hardware
at it.

When troubleshooting such problems, it helps to have a root shell open
to the remote machine. That way, you may be able to run "top" or "htop"
just as things start to go awry. At the very least, you could initiate a
restart from that shell so that you wouldn't have to power cycle the
machine. Power cycling should be avoided if at all possible.

It would also make your life easier if you virtualized that physical
server because you would have a way of doing clean restarts on the
server without having to power cycle. We've had many cases of virtual
servers that had become unresponsive due to resource exhaustion issues
that we simply initiated a "restart" from the physical machine layer in
Xen or OpenVZ.

Whether the "restart" is initiated at a root shell or from the physical
machine layer, don't be surprised if it takes a while for the machine to
actually respond. I've patiently waited half an hour sometimes. It's
quite likely that the load average on your machine when it becomes
unresponsive is through the roof so it will take a while to get the
attention of the CPU for it to initiate the shutdown.

Another optimization you could make is to replace Apache with nginx.
nginx is much gentler on memory than Apache so it can service more
concurrent requests, by a significant margin, than Apache. We usually
build from source because the version in the Hardy repo is old.
-- 
Regards,

Clifford Ilkay
Dinamis
1419-3266 Yonge St.
Toronto, ON
Canada  M4N 3P6

<http://dinamis.com>
+1 416-410-3326
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3286 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20090726/11454b7f/attachment.bin>


More information about the ubuntu-users mailing list