[SRU][B][PATCH 1/1] sched/fair: Fix bandwidth timer clock drift condition

Fri Jun 28 15:30:55 UTC 2019

On 2019-06-28 17:16:20 , Stefan Bader wrote:
> On 28.06.19 16:25, Connor Kuehl wrote:
> > On 6/28/19 6:09 AM, Stefan Bader wrote:
> >> On 21.06.19 20:18, Connor Kuehl wrote:
> >>> On 6/14/19 8:51 AM, Khalid Elmously wrote:
> >>>> From: Xunlei Pang <xlpang at linux.alibaba.com>
> >>>>
> >>>> BugLink: https://bugs.launchpad.net/bugs/1832151
> >>>>
> >>>> I noticed that cgroup task groups constantly get throttled even
> >>>> if they have low CPU usage, this causes some jitters on the response
> >>>> time to some of our business containers when enabling CPU quotas.
> >>>>
> >>>> It's very simple to reproduce:
> >>>>
> >>>>   mkdir /sys/fs/cgroup/cpu/test
> >>>>   cd /sys/fs/cgroup/cpu/test
> >>>>   echo 100000 > cpu.cfs_quota_us
> >>>>   echo $$ > tasks
> >>>>
> >>>> then repeat:
> >>>>
> >>>>   cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
> >>>>
> >>>> After some analysis, we found that cfs_rq::runtime_remaining will
> >>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
> >>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
> >>>>
> >>>> The current condition to judge clock drift in expire_cfs_rq_runtime()
> >>>> is wrong, the two runtime_expires are actually the same when clock
> >>>> drift happens, so this condtion can never hit. The orginal design was
> >>>> correctly done by this commit:
> >>>>
> >>>>   a9cf55b28610 ("sched: Expire invalid runtime")
> >>>>
> >>>> ... but was changed to be the current implementation due to its locking bug.
> >>>>
> >>>> This patch introduces another way, it adds a new field in both structures
> >>>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
> >>>> uses them to figure out if clock drift happens (true if they are equal).
> >>>>
> >>>> Signed-off-by: Xunlei Pang <xlpang at linux.alibaba.com>
> >>>> Signed-off-by: Peter Zijlstra (Intel) <peterz at infradead.org>
> >>>> Reviewed-by: Ben Segall <bsegall at google.com>
> >>>> Cc: Linus Torvalds <torvalds at linux-foundation.org>
> >>>> Cc: Peter Zijlstra <peterz at infradead.org>
> >>>> Cc: Thomas Gleixner <tglx at linutronix.de>
> >>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
> >>>> Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
> >>>> Signed-off-by: Ingo Molnar <mingo at kernel.org>
> >>>> (backported from commit 512ac999d2755d2b7109e996a76b6fb8b888631d)
> >>>> [ kmously: Adjusted for different definitions of struct cfs_bandwidth and struct
> >>>>  cfs_rq ]
> >>>> Signed-off-by: Khalid Elmously <khalid.elmously at canonical.com>
> >>>
> >>> This looks good to me, but the bugzilla links to another patch that's on
> >>> its way upstream that claims to follow up on a regression introduced by
> >>> this patch. Should that patch also be included here? I only ask because
> >>> I'm not sure I have all the information/knowledge to form an opinion on
> >>> that follow-up patch.
> >>
> >> Would help if you supplied links to the patch or whatever bugzilla comment.
> > 
> > Sorry. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198197
> > 
> > Links to this patch on LKML: https://lkml.org/lkml/2019/5/17/581
> 
> OK, so I am not sure whether the approach suggested by Dave Chiluk will accepted
> by upstream. The question would be how bad is what he describes as the
> regression compared to what we would fix. And a bit doubtful I am in a state of
> mind to decide right now.
> Khalid, as the one supplied this, do you have a gut feeling about the importance
> of the fix?
> 

The fix is important for the customer. They had originally stated that it resolves the scheduling problems but then they came back and said that it only "partially" resolves the problems.

Given that this fix is implicated in subsequent regressions and given the change in the customer's feedback regarding it, I think it's best to hold off on this for now, so I will NACK this patch for now.

Thanks for catching that follow-up issue Connor and for the feedback Stefan.

> -Stefan
> 
> > 
> >>
> >>>
> >>>> ---
> >>>>  kernel/sched/fair.c  | 14 ++++++++------
> >>>>  kernel/sched/sched.h |  6 +++++-
> >>>>  2 files changed, 13 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 61365fcbe148..2ec80e0822a5 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -4413,6 +4413,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >>>>  	now = sched_clock_cpu(smp_processor_id());
> >>>>  	cfs_b->runtime = cfs_b->quota;
> >>>>  	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> >>>> +	cfs_b->expires_seq++;
> >>>>  }
> >>>>  
> >>>>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >>>> @@ -4435,6 +4436,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >>>>  	struct task_group *tg = cfs_rq->tg;
> >>>>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> >>>>  	u64 amount = 0, min_amount, expires;
> >>>> +	int expires_seq;
> >>>>  
> >>>>  	/* note: this is a positive sum as runtime_remaining <= 0 */
> >>>>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> >>>> @@ -4451,6 +4453,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >>>>  			cfs_b->idle = 0;
> >>>>  		}
> >>>>  	}
> >>>> +	expires_seq = cfs_b->expires_seq;
> >>>>  	expires = cfs_b->runtime_expires;
> >>>>  	raw_spin_unlock(&cfs_b->lock);
> >>>>  
> >>>> @@ -4460,8 +4463,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >>>>  	 * spread between our sched_clock and the one on which runtime was
> >>>>  	 * issued.
> >>>>  	 */
> >>>> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
> >>>> +	if (cfs_rq->expires_seq != expires_seq) {
> >>>> +		cfs_rq->expires_seq = expires_seq;
> >>>>  		cfs_rq->runtime_expires = expires;
> >>>> +	}
> >>>>  
> >>>>  	return cfs_rq->runtime_remaining > 0;
> >>>>  }
> >>>> @@ -4487,12 +4492,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >>>>  	 * has not truly expired.
> >>>>  	 *
> >>>>  	 * Fortunately we can check determine whether this the case by checking
> >>>> -	 * whether the global deadline has advanced. It is valid to compare
> >>>> -	 * cfs_b->runtime_expires without any locks since we only care about
> >>>> -	 * exact equality, so a partial write will still work.
> >>>> +	 * whether the global deadline(cfs_b->expires_seq) has advanced.
> >>>>  	 */
> >>>> -
> >>>> -	if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> >>>> +	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> >>>>  		/* extend local deadline, drift is bounded above by 2 ticks */
> >>>>  		cfs_rq->runtime_expires += TICK_NSEC;
> >>>>  	} else {
> >>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>> index 41be9d48380f..3798f948477f 100644
> >>>> --- a/kernel/sched/sched.h
> >>>> +++ b/kernel/sched/sched.h
> >>>> @@ -280,8 +280,11 @@ struct cfs_bandwidth {
> >>>>  	u64 quota, runtime;
> >>>>  	s64 hierarchical_quota;
> >>>>  	u64 runtime_expires;
> >>>> +	int expires_seq;
> >>>>  
> >>>> -	int idle, period_active;
> >>>> +
> >>>> +	short idle;
> >>>> +	short period_active;
> >>>>  	struct hrtimer period_timer, slack_timer;
> >>>>  	struct list_head throttled_cfs_rq;
> >>>>  
> >>>> @@ -490,6 +493,7 @@ struct cfs_rq {
> >>>>  
> >>>>  #ifdef CONFIG_CFS_BANDWIDTH
> >>>>  	int runtime_enabled;
> >>>> +	int expires_seq;
> >>>>  	u64 runtime_expires;
> >>>>  	s64 runtime_remaining;
> >>>>  
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> > 
> > 
> 
>