LTSP network freezing in Dapper

Gavin McCullagh gmccullagh at gmail.com
Mon Jul 31 19:41:38 BST 2006


Hi,

On Mon, 31 Jul 2006, Hendrik Boshoff wrote:

> Recently, the network started freezing randomly. The timing is
> unpredictable, but every client ceases getting response from the server
> at exactly the same point. Waiting for about three to five minutes
> usually allows the system to resolve the problem automatically. Neither
> the CPU nor the network load is heavy at the time, and the response on
> the server terminal is perfect. I noticed two other things: The lights on
> the network switch flash in unison at a rate of about 2 Hz under this
> condition, and a perl script with a name similar to network-conf floats
> to the top of the process list in terms of CPU time used. Cycling the
> power on the network switch seems to get the server to attend to the
> problem sooner, but I may be imagining this.

That perl script might be:

	/usr/share/setup-tool-backends/scripts/network-conf

which appears to be involved in configuring the network on an Ubuntu
machine.  I'm not sure what would cause this to be run at random.

Instinctively this sounds more like a network problem than a server one.  I
may be wrong but when I've seen a switch flashing like this it has
sometimes been:

1. The switch crashing (which might explain your power cycle helping).  If
   this is a cheap dumb switch your only answer might be a replacement.

2. Something nasty happening on the network.  A lot of repeated flashing
   might indicate a series of broadcast packets (which go out along every
   line).  This might be a virus ridden PC.  We once found a PC with a
   virus which did ARP poisoning, faking the mac address of our most
   important machines (PDC, SDC, Router).  This is very nasty so I hope
   that's not your problem. 
	http://en.wikipedia.org/wiki/ARP_spoofing

3. Of course the switch flashing might be because of the problem -- all of
   the machines are trying to connect to the server and it's not
   responding.

If your switch died, it would be normal enough for machines to spend some
time retrying before eventually giving up.  In the case of MAC spoofing,
your machines would be effectively aiming their traffic at the wrong server
until they got back the right mac address.

To try to diagnose it, I'd suggest:

a. Look in /var/log/syslog, messages, kern.log and daemon.log on the server
   at the time this happened.  Check to see if there are any clues here.
b. Place a standalone machine on the network.  If you see the network die,
   try pinging the server from this machine.  Make sure you can ping it.
c. Note the MAC address of the server in advance (type /usr/sbin/arp). When
   the network goes down, check that the mac address hasn't changed (an IP
   address conflict might also cause this).
d. Have a look at the memory usage on the server (the free command).  Just
   make sure it hasn't run out of memory or something and is trying to kill
   processes (this should appear in the logs).

If you can't be on-site to see it, you could run "arpwatch" and "smokeping"
to monitor the ping times and the arp (mac address) entries.

> There are sometimes problems as well with timeout of the NFS mount of
> /root during LTSP client booting (the second ok expected on the brown
> boot screen).

Is this at the same time as the other problems or just another problem?  

> When the network freeze happens during class time, he reboots the server,
> which temporarily helps. 
>  
> So what are my options? How do I troubleshoot LTSP networking? Do I
> reinstall on a third partition and be more conservative about changes to
> the default this time? I would not like to roll back postfix, the kids
> enjoy their email too much. I suppose I should get on IRC over the
> weekend for some assisted troubleshooting in real time. Any other
> suggestions?

I really can't see how it could have anything to do with Postfix.  If you
can see the problem happen, try the pings and check the arp table.  If you
have a spare switch you could pop in for a week you could try that and see
does the problem repeat.  It sounds like your server is dropping off the
network for some reason.

Let us know how you get on,

Gavin





More information about the edubuntu-users mailing list