[PATCH v3?] sched: update load count only once per cpu in 10 tick update window
Colin Ian King
colin.king at canonical.com
Fri Apr 9 16:29:47 UTC 2010
On Thu, 2010-04-08 at 12:02 -0400, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
>
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick update window.
>
> A few points:
>
> * A global atomic deferral counter, and not per-cpu vars, is needed
> because a cpu may go NOHZ idle and not be able to update the global
> calc_load_tasks variable for subsequent load calculations.
> * It is not enough to add calls to account for the load when a cpu is
> awakened:
> - Load avg calculation must be independent of cpu load.
> - If a cpu is awakend by one tasks, but then has more scheduled before
> the end of the update window, only the first task will be accounted.
>
> BugLink: http://bugs.launchpad.net/bugs/513848
>
> Signed-off-by: Chase Douglas <chase.douglas at canonical.com>
> ---
> kernel/sched.c | 24 ++++++++++++++++++++++--
> 1 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 81ede13..c372249 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
>
> /* Variables and functions for calc_load */
> static atomic_long_t calc_load_tasks;
> +static atomic_long_t calc_load_tasks_deferred;
> static unsigned long calc_load_update;
> unsigned long avenrun[3];
> EXPORT_SYMBOL(avenrun);
> @@ -3021,7 +3022,7 @@ void calc_global_load(void)
> */
> static void calc_load_account_active(struct rq *this_rq)
> {
> - long nr_active, delta;
> + long nr_active, delta, deferred;
>
> nr_active = this_rq->nr_running;
> nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
> if (nr_active != this_rq->calc_load_active) {
> delta = nr_active - this_rq->calc_load_active;
> this_rq->calc_load_active = nr_active;
> +
> + /*
> + * Update calc_load_tasks only once per cpu in 10 tick update
> + * window.
> + */
> + if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> + time_after_eq(jiffies, calc_load_update))) {
> + if (delta)
> + atomic_long_add(delta,
> + &calc_load_tasks_deferred);
> + return;
> + }
> +
> + if (atomic_long_read(&calc_load_tasks_deferred)) {
> + deferred = atomic_long_xchg(&calc_load_tasks_deferred,
> + 0);
> + delta += deferred;
> + }
> +
> atomic_long_add(delta, &calc_load_tasks);
> }
> }
> @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
> }
>
> if (time_after_eq(jiffies, this_rq->calc_load_update)) {
> - this_rq->calc_load_update += LOAD_FREQ;
> calc_load_account_active(this_rq);
> + this_rq->calc_load_update += LOAD_FREQ;
> }
> }
OK - this does have some subtly to it that's for sure, especially with
the this_rg->calc_load_update dependency and hence the need to
re-ordering this after calc_load_account_active. One needs to tread
carefully here.
The description of the change and change needs some hard thought, but
I believe it does the trick as you claim.
Acked-by: Colin King <colin.king at canonical.com>
More information about the kernel-team
mailing list