[Karmic][PATCH v3] sched: update load count only once per cpu in 10 tick update window

Chase Douglas chase.douglas at canonical.com
Tue Apr 20 13:46:03 UTC 2010

There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.

A few points:

* A global atomic deferral counter, and not per-cpu vars, is needed
  because a cpu may go NOHZ idle and not be able to update the global
  calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
  - Load avg calculation must be independent of cpu load.
  - If a cpu is awakend by one tasks, but then has more scheduled before
    the end of the update window, only the first task will be accounted.

BugLink: http://bugs.launchpad.net/bugs/513848

Stable Release Update Justification:

Impact of bug: On low load systems with specific work load
characteristics the load average may not be representative of the real
load. Given a specific test case it is possible to load a system with
NR_CPUS number of tasks and yet have a load avg of 0.00, though this is
unlikely to occur in real work loads.

How addressed: The attached patch reworks the load accounting mechanism
in the kernel scheduler. It ensures that the accounting is strictly
dependent on the time (i.e. snapshot taken every 5 seconds) and the
number of runnable and uninterruptible tasks at that given time.
Previously, the accounting also depended on whether a cpu goes idle
shortly after the 5 second snapshot.

Reproduction: See attached reproduction test case. Run it once on a
non-loaded system (boot to rescue mode works well). Top will report the
cpu usage at 90%, but uptime will report a load avg near 0.00 instead of
at least 0.90 as expected.

Regression potential: The patch has been received well from senior
Ubuntu kernel team members and some of the upstream kernel maintainers
on lkml. For this reason it is assumed to be a good fix for this issue.
The only code path touched by this patch involves the load avg
accounting, so potential regressions could include incorrect load avg
and/or some unforeseen general bug like a null dereference. However, the
likelihood of either is minimal due to proper and thorough patch review.

Signed-off-by: Chase Douglas <chase.douglas at canonical.com>
Acked-by: Colin King <colin.king at canonical.com>
Acked-by: Andy Whitcroft <apw at canonical.com>
 kernel/sched.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 81ede13..c372249 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void)
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
 static unsigned long calc_load_update;
 unsigned long avenrun[3];
@@ -3021,7 +3022,7 @@ void calc_global_load(void)
 static void calc_load_account_active(struct rq *this_rq)
-	long nr_active, delta;
+	long nr_active, delta, deferred;
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq)
 	if (nr_active != this_rq->calc_load_active) {
 		delta = nr_active - this_rq->calc_load_active;
 		this_rq->calc_load_active = nr_active;
+		/*
+		 * Update calc_load_tasks only once per cpu in 10 tick update
+		 * window.
+		 */
+		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+			     time_after_eq(jiffies, calc_load_update))) {
+			if (delta)
+				atomic_long_add(delta,
+						&calc_load_tasks_deferred);
+			return;
+		}
+		if (atomic_long_read(&calc_load_tasks_deferred)) {
+			deferred = atomic_long_xchg(&calc_load_tasks_deferred,
+						    0);
+			delta += deferred;
+		}
 		atomic_long_add(delta, &calc_load_tasks);
@@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq)
 	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
+		this_rq->calc_load_update += LOAD_FREQ;

More information about the kernel-team mailing list