[REGRESSION 2.6.30][PATCH v3] sched: update load count only once per cpu in 10 tick update window
chase.douglas at canonical.com
Mon Apr 19 20:16:42 UTC 2010
On Mon, Apr 19, 2010 at 11:52 AM, Peter Zijlstra <peterz at infradead.org> wrote:
> On Tue, 2010-04-13 at 16:19 -0700, Chase Douglas wrote:
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick update window.
>> A few points:
>> * A global atomic deferral counter, and not per-cpu vars, is needed
>> because a cpu may go NOHZ idle and not be able to update the global
>> calc_load_tasks variable for subsequent load calculations.
>> * It is not enough to add calls to account for the load when a cpu is
>> - Load avg calculation must be independent of cpu load.
>> - If a cpu is awakend by one tasks, but then has more scheduled before
>> the end of the update window, only the first task will be accounted.
> OK, so what you're saying is that because we update calc_load_tasks from
> entering idle, we decrease earlier than a regular 10 tick sample
> interval would?
> Hence you batch these early updates into _deferred and let the next 10
> tick sample roll them over?
> So the only early updates can come from
> pick_next_task_idle()->calc_load_account_active(), so why not specialize
> that callchain instead of the below?
> Also, since its all NO_HZ, why not stick this in with the ILB? Once
> people get around to making that scale better, this can hitch a ride.
> Something like the below perhaps? It does run partially from softirq
> context, but since there's a distinct lack of synchronization here that
> didn't seem like an immediate problem.
I understand everything until you move the calc_load_account_active
call to run_rebalance_domains. I take it that when CPUs go NO_HZ idle,
at least one cpu is left to monitor and perform updates as necessary.
Conceptually, it makes sense that this cpu should be handling the load
accounting updates. However, I'm new to this code, so I'm having a
hard time understanding all the cases and timings for when the
scheduler softirq is called. Is it guaranteed to be called during
every 10 tick load update window? If not, then we'll have the issue
where a NO_HZ idle cpu won't be updated to 0 running tasks in time for
the load avg calculation.
Would someone be able to explain how we are guaranteed of the correct
timing for this path?
I also have a concern with run_rebalance_domains: If the designated
no_hz.load_balancer cpu wasn't idle at the last tick or needs
rescheduling, load accounting won't occur for idle cpus. Is it
possible for this to occur every time when called in the 10 tick
More information about the kernel-team