[Xenial][SRU][CVE-2018-20784][PATCH 1/1] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c

Connor Kuehl connor.kuehl at canonical.com
Fri Sep 27 18:54:50 UTC 2019

From: Linus Torvalds <torvalds at linux-foundation.org>


Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list

Do a (manual) revert of:

  a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b9c and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b9c - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi at huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot at linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1 at huawei.com>
Reported-by: Sargun Dhillon <sargun at sargun.me>
Reported-by: Xie XiuQi <xiexiuqi at huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1 at huawei.com>
Tested-by: Sargun Dhillon <sargun at sargun.me>
Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot at linaro.org>
Cc: <stable at vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin at huawei.com>
Cc: Mike Galbraith <efault at gmx.de>
Cc: Peter Zijlstra <peterz at infradead.org>
Cc: Tejun Heo <tj at kernel.org>
Cc: Thomas Gleixner <tglx at linutronix.de>
Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo at kernel.org>
(backported from commit c40f7d74c741a907cfaeb73a7697081881c497d0)
[ Connor Kuehl: context adjustments were required to remove
  'cfs_rq_is_decayed()' and to merge the changes to 'update_blocked_averages()'.
  Also, I had to manually update old instances of 'for_each_leaf_cfs_rq_safe'
  to its successor which is introduced by this patch 'for_each_leaf_cfs_rq'. ]
Signed-off-by: Connor Kuehl <connor.kuehl at canonical.com>
 kernel/sched/fair.c | 44 ++++++++++----------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ad5fa326de1..3433cbe3e96c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -313,10 +313,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
-	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
-				 leaf_cfs_rq_list)
+/* Iterate through all leaf cfs_rq's on a runqueue: */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -409,8 +408,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
-		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -6034,27 +6033,10 @@ static void attach_tasks(struct lb_env *env)
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-	if (cfs_rq->load.weight)
-		return false;
-	if (cfs_rq->avg.load_sum)
-		return false;
-	if (cfs_rq->avg.util_sum)
-		return false;
-	if (cfs_rq->runnable_load_sum)
-		return false;
-	return true;
-static void update_blocked_averages(int cpu)
+static inline bool cfs_rq_istatic void update_blocked_averages(int cpu)
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	raw_spin_lock_irqsave(&rq->lock, flags);
@@ -6064,7 +6046,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
-	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		/* throttled entities do not contribute to load */
 		if (throttled_hierarchy(cfs_rq))
@@ -6073,12 +6055,6 @@ static void update_blocked_averages(int cpu)
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -8484,10 +8460,10 @@ const struct sched_class fair_sched_class = {
 void print_cfs_stats(struct seq_file *m, int cpu)
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
-	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);

More information about the kernel-team mailing list