[REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period

Chase Douglas chase.douglas at canonical.com
Thu Apr 1 19:56:20 UTC 2010


On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm at linux-foundation.org> wrote:
> On Mon, 29 Mar 2010 09:41:12 -0400
> Chase Douglas <chase.douglas at canonical.com> wrote:
>
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick period.
>>
>> BugLink: http://bugs.launchpad.net/bugs/513848
>>
>
> There was useful information in the [patch 0/1] email, such as the
> offending commit ID.  If possible, it's best to avoid the [patch 0/n]
> thing altogether - that information either has to be moved into the
> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
> lost.

I'll be sure to fix that up in any future versions.

>
>> ---
>>  kernel/sched.c |   16 ++++++++++++++--
>>  1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 9ab3cd7..c0aedac 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>>   */
>>  static void calc_load_account_active(struct rq *this_rq)
>>  {
>> -     long nr_active, delta;
>> +     static atomic_long_t deferred;
>> +     long nr_active, delta, deferred_delta;
>>
>>       nr_active = this_rq->nr_running;
>>       nr_active += (long) this_rq->nr_uninterruptible;
>> @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>>       if (nr_active != this_rq->calc_load_active) {
>>               delta = nr_active - this_rq->calc_load_active;
>>               this_rq->calc_load_active = nr_active;
>> +
>> +             /* Need to defer idle accounting during load update period: */
>> +             if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
>> +                          time_after_eq(jiffies, calc_load_update))) {
>> +                     atomic_long_add(delta, &deferred);
>> +                     return;
>> +             }
>
> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
>
> What was the reason for "update the avenrun load estimates 10 ticks
> after the CPUs have updated calc_load_tasks"?  Can we do something
> smarter there to fix this?
>
>> +             deferred_delta = atomic_long_xchg(&deferred, 0);
>> +             delta += deferred_delta;
>> +
>>               atomic_long_add(delta, &calc_load_tasks);
>>       }
>>  }
>
> The global `deferred' is unfortunate from a design and possibly
> scalability POV.  Can it be moved into the `struct rq'?  That way it
> can become a plain old `unsigned long', too.

The reason it's global is that any given cpu may go NOHZ idle. If they
sleep through the next load accounting, then their deferred value
won't get accounted for. By keeping it global and having it checked by
every cpu, we ensure that in the worst case where only one cpu is
awake to do the accounting, the deferred values from all cpus are
taken into account.

After a review by Andy Whitcroft on the Ubuntu kernel team, we decided
to do a few things to improve performance:

1. Check if values are non-zero before atomically touching the
deferred variable (load avg accounting is imprecise anyways, we just
need to make sure we don't lose any tasks)
2. Rename the deferred variable to calc_load_tasks_deferred and move
it right after calc_load_tasks to hopefully catch it in the same cache
line

Thomas Gleixner seems to think this isn't the best approach (later
email in this thread), so I'm deferring sending it unless someone is
still interested in this approach.

-- Chase




More information about the kernel-team mailing list