Server increasing load due increasing processes in D state

Alessandro Tagliapietra tagliapietra.alessandro at gmail.com
Mon Feb 25 15:39:18 UTC 2013


Hello Eduardo,  

I've rebooted the server and now it's running memtest, 40% done and no errors so far.

We've openstack (essex) configured on two servers, first one (this server) was an all-in-one complete install, then I've setup the second compute node adding nova-network in multi-host mode. Each server has a public internet interface and a crossover cable between them on an additional NIC.

Each server's cpu is an Intel i7-2600 with 16GB non ECC ram, 2 x 3TB HDD with software raid1 (mdadm).

So far I've about 8 KVM machines running on it and yes, I've rabbitMQ running on this node too, also with MySql server, nova api, keystone and glance.
Kernel version is 3.2.0-37-generic. Openstack packages has been updated to latest version 1 week ago.

For the first 5-10 days it works fine, no processes in D state except for some seconds (which I think is normal)

When the memtest finished I'll run the kernel traces you've asked for.

By the way, I've started encountering this problem some months ago due a byobu-status processes that got run lot of times by byobu and driven the load to more then 200. Then I've disabled byobu and got the issue on other processes too (even an ls -la hangs sometimes)

I'll let you know of the results.

Thank you very much for helping.

Best

--

Alessandro Tagliapietra  
alexfu.it (http://www.alexfu.it)  

Il giorno lunedì 25 febbraio 2013, alle ore 15:39, Eduardo Damato ha scritto:  

> Hi Alessandro,
>  
> What's the node you're having problems with? Is this a compute node? Can you give more information on the layout of your nova installation? I can see that qemu and rabbit-mq are running on the same node. Do you use the compute node as an MQ node as well?
>  
> The problem here seems more to be related to the kernel, since many many tasks are stuck in the same W_CHAN.  
>  
> Ideally It would be good to have the output of sysrq-t from this system, but this can cause the system to hang or crash depending on what the status is, specially because we already know that there are many task_structs blocked in the same place.  
>  
> you could do:
>  
> # echo t > /proc/sysrq-trigger
> (wait 5 s)
> # echo t > /proc/sysrq-trigger
> (wait 5 s)
> # echo t > /proc/sysrq-trigger
>  
> And then we can have a look at the traces and see if they're moving or not.
>  
> lsof is blocked reading the memory maps of process 1227. This could lead to more information on the problem, but at the same time because there are so many blocked processes it could be just another sign of the problem and not a hint to the reason why this is happening.  
>  
> Without kernel traces (sysrq-t) or a vmcore it would be complicated to understand what's happening. It doesn't seem to be IO related.
>  
> Cheers,
> Eduardo.
>  
> On 25/02/13 12:10, Alessandro Tagliapietra wrote:
> > After an strace of lsof I've seen it hangs on  
> >  
> > stat("/proc/1227/", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/1227/stat", O_RDONLY) = 4 read(4, "1227 (nova-dhcpbridge) D 1224 25"..., 4096) = 242 close(4) = 0 readlink("/proc/1227/cwd", "/"..., 4096) = 1 stat("/proc/1227/cwd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 readlink("/proc/1227/root", "/", 4096) = 1 stat("/proc/1227/root", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 readlink("/proc/1227/exe", "/usr/bin/python2.7"..., 4096) = 18 stat("/proc/1227/exe", {st_mode=S_IFREG|0755, st_size=2989480, ...}) = 0 open("/proc/1227/maps", O_RDONLY) = 4 read(4,  
> > Could it be a memory issue?
> > Actually I cannot run the memory test, maybe tomorrow. Just to know if someone else had the same issue.
> > Thanks in advance
> >  
> > --
> >  
> > Alessandro Tagliapietra  
> > alexfu.it (http://www.alexfu.it)  
> >  
> > Il giorno lunedì 25 febbraio 2013, alle ore 12:29, Alessandro Tagliapietra ha scritto:  
> >  
> > > Hello guys,  
> > >  
> > > at work we've the openstack controller that since some months started to increase its load after some days of uptime.  
> > >  
> > > I've seen that the cause is that processes sometimes hangs and remain in D state.  
> > >  
> > > I've used some combination of ps args to get these outputs:  
> > >  
> > > http://pastebin.com/raw.php?i=LGGzGrWu  
> > > http://pastie.org/pastes/6332964/text
> > > http://pastie.org/pastes/6332979/text
> > >  
> > > The hdd is a soft-raid1 over 2 disks, which SMART values are fine.  
> > >  
> > > Commands like lsof, strace on a D process doesn't return.  
> > >  
> > > Any idea on what could be the cause?  
> > >  
> > > Thanks in advance  
> > >  
> > > --
> > >  
> > > Alessandro Tagliapietra  
> > > alexfu.it (http://www.alexfu.it)  
> >  
> >  
> >  
>  
> --  
> ubuntu-server mailing list
> ubuntu-server at lists.ubuntu.com (mailto:ubuntu-server at lists.ubuntu.com)
> https://lists.ubuntu.com/mailman/listinfo/ubuntu-server
> More info: https://wiki.ubuntu.com/ServerTeam
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-server/attachments/20130225/513d95df/attachment.html>


More information about the ubuntu-server mailing list