Fwd: [084/152] sched: Cure more NO_HZ load average woes

Stefan Bader stefan.bader at canonical.com
Tue Jan 25 11:08:59 UTC 2011


On 01/24/2011 04:00 PM, Chase Douglas wrote:
> On 01/24/2011 09:50 AM, Stefan Bader wrote:
>> On 01/06/2011 04:09 PM, Chase Douglas wrote:
>>> Hi all,
>>>
>>> I received this notification of a stable patch for .36 that should fix
>>> the load avg bugs once and for all. A recap:
>>>
>>> I found a bug in the load avg calculation and got a fix pushed upstream.
>>> This was thrown into lucid and maverick. Unfortunately, it caused a
>>> regression for our xen kernels, so it was removed from maverick ec2
>>> IIRC. Maybe from others too? This is the commit hash for ubuntu-maverick
>>> master:
>>>
>>> 74f5187ac873042f502227701ed1727e7c5fbfa9
>>>
>>> I believe this patch should be reenabled for all lucid and maverick
>>> kernels, and the following patch should be applied on top. I'm not sure
>>> how everything is falling out with the new stable queue process, so I'm
>>> forwarding this to the list just to be sure it's seen.
>>>
>>> Thanks!
>>>
>>
>> I must admit that the maths are a bit beyond my understanding. Though given that
>> the first half is in Maverick and the second went as a stable update for .36,
>> this seems to be the right thing to do.
>>
>> Acked-by: Stefan Bader <stefan.bader at canonical.com>
>>
>> This has now also been reported as a bug
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/706592
>>
>> Secondary question would be for Lucid. As the patch did not cleanly apply there,
>> I checked what currently is in Lucid and it seems the first part is
>> not there. But another patch:
>>
>> commit 0d843425672f4d2dc99b9004409aae503ef4d39f
>> Author: Chase Douglas <chase.douglas at canonical.com>
>> Date:   Thu Apr 8 12:02:11 2010 -0400
>>
>>     sched: update load count only once per cpu in 10 tick update window
>>
>> This does not show up upstream and I think I remember vaguely that this one was
>> replaced by the first upstream patch that is in Maverick.
>>
>> So I guess the action required there would be to revert the Ubuntu specific
>> patch and apply both halfs of the upstream solution. Still it would be quite
>> nice to have some way of verification. Has anybody already some sort of testcase
>> for this?
> 
> I wrote a testcase for the original bug:
> 
> http://lkml.org/lkml/2010/3/29/170
> 
> As for the second bug, I'm not sure what a good testcase is.
> 
> -- Chase

I played around with the test case this morning and while I think it seems to be
ok for the initial problem (task at cpu% 90 and loadavg 0), I am not sure about
the results with a Lucid kernel having the old fix reverted and the two upstream
halves added. The load rises when running the test case, but never really to 0.9
and even with that test task running and nothing else going on the 1min average
moves between 0.8 and 0.5. That could be misunderstanding of what the loadavg
really means or something else. I just don't feel like it is giving me any real
confidence in any direction at the moment.

-Stefan




More information about the kernel-team mailing list