[SRU][PATCH 2/2][Bionic] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Mon Oct 21 09:04:10 UTC 2019

There is an additional follow-on fix for this commit:

commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3
Author: Qian Cai <cai at lca.pw>
Date:   Tue Aug 20 14:40:55 2019 -0400

    sched/fair: Fix -Wunused-but-set-variable warnings

    Commit:

       de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

    introduced a few compilation warnings:

      kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
      kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
      kernel/sched/fair.c: In function 'start_cfs_bandwidth':
      kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

    Also, __refill_cfs_bandwidth_runtime() does no longer update the
    expiration time, so fix the comments accordingly.

    Signed-off-by: Qian Cai <cai at lca.pw>
    Signed-off-by: Peter Zijlstra (Intel) <peterz at infradead.org>
    Reviewed-by: Ben Segall <bsegall at google.com>
    Reviewed-by: Dave Chiluk <chiluk+linux at indeed.com>
    Cc: Linus Torvalds <torvalds at linux-foundation.org>
    Cc: Peter Zijlstra <peterz at infradead.org>
    Cc: Thomas Gleixner <tglx at linutronix.de>
    Cc: pauld at redhat.com
    Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
    Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar <mingo at kernel.org>

It's benign and from my limited testing doesn't seem to be needed in
Bionic since we're compiling with -Wno-unused-but-set-variable but certainly
won't hurt to pull in as well. In general, when cherry-picking or backporting
patches, one always should check if there are follow-on fixes for those new
commits.

Patch coming up soon.

...Juerg

> From: Dave Chiluk <chiluk+linux at indeed.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1832151
> 
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu Fix low cpu usage with high throttling cores.
> 
> This has been root caused to cpu-local run queue being allocated per cpu
> bandwidth slices, and then not fully using that slice within the period.
> At which point the slice and quota expires. This expiration of unused
> slice results in applications not being able to utilize the quota for
> which they are allocated.
> 
> The non-expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this had been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevent the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
> 
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. ThisFix low cpu usage with high throttling  means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux at indeed.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz at infradead.org>
> Reviewed-by: Phil Auld <pauld at redhat.com>
> Reviewed-by: Ben Segall <bsegall at google.com>
> Cc: Ingo Molnar <mingo at redhat.com>
> Cc: John Hammond <jhammond at indeed.com>
> Cc: Jonathan Corbet <corbet at lwn.net>
> Cc: Kyle Anderson <kwa at yelp.com>
> Cc: Gabriel Munos <gmunoz at netflix.com>
> Cc: Peter Oskolkov <posk at posk.io>
> Cc: Cong Wang <xiyou.wangcong at gmail.com>
> Cc: Brendan Gregg <bgregg at netflix.com>
> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
> (backported from commit de53fd7aedb100f03e5d2231cfce0e4993282425)
> [ kmously: A few context adjustments. Mostly because the patch expects some
> raw_spin_lockirqsave() functions in fair.c but the bionic code still uses raw_spin_lock().
> Also 2 lines that needed to be removed from start_cfs_bandwidth() are non-existent in bionic.
> Furthermore, needed to adjust for different whitespace/context in the structs in sched.h ]
> Signed-off-by: Khalid Elmously <khalid.elmously at canonical.com>
> ---
>  Documentation/scheduler/sched-bwc.txt | 74 ++++++++++++++++++++++-----
>  kernel/sched/fair.c                   | 70 +++----------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 67 insertions(+), 81 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873f68ab..3397995168bb 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota units
> -above at each period boundary.  As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> +slices as threads in the cgroup become runnable. Once all quota has been
> +assigned any additional requests for quota will result in those threads being
> +throttled. Throttled threads will not be able to run again until the next
> +period when the quota is replenished.
> +
> +A group's unassigned quota is globally tracked, being refreshed back to
> +cfs_quota units at each period boundary. As threads consume this bandwidth it
> +is transferred to cpu-local "silos" on a demand basis. The amount transferred
>  within each of these updates is tunable and described as the "slice".
>  
>  Management
> @@ -33,12 +34,12 @@ The default values are:
>  
>  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>  bandwidth restriction in place, such a group is described as an unconstrained
> -bandwidth group.  This represents the traditional work-conserving behavior for
> +bandwidth group. This represents the traditional work-conserving behavior for
>  CFS.
>  
>  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> -The minimum quota allowed for the quota or period is 1ms.  There is also an
> -upper bound on the period length of 1s.  Additional restrictions exist when
> +The minimum quota allowed for the quota or period is 1ms. There is also an
> +upper bound on the period length of 1s. Additional restrictions exist when
>  bandwidth limits are used in a hierarchical fashion, these are explained in
>  more detail below.
>  
> @@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
>  System wide settings
>  --------------------
>  For efficiency run-time is transferred between the global pool and CPU local
> -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> -on large systems.  The amount transferred each time such an update is required
> +"silos" in a batch fashion. This greatly reduces global accounting pressure
> +on large systems. The amount transferred each time such an update is required
>  is described as the "slice".
>  
>  This is tunable via procfs:
> @@ -90,6 +91,51 @@ There are two ways in which a group may become throttled:
>  In case b) above, even though the child may have runtime remaining it will not
>  be allowed to until the parent's runtime is refreshed.
>  
> +CFS Bandwidth Quota Caveats
> +---------------------------
> +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> +the slice may be returned to the global pool if all threads on that cpu become
> +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> +variable. This is a performance tweak that helps prevent added contention on
> +the global lock.
> +
> +The fact that cpu-local slices do not expire results in some interesting corner
> +cases that should be understood.
> +
> +For cgroup cpu constrained applications that are cpu limited this is a
> +relatively moot point because they will naturally consume the entirety of their
> +quota as well as the entirety of each cpu-local slice in each period. As a
> +result it is expected that nr_periods roughly equal nr_throttled, and that
> +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> +
> +For highly-threaded, non-cpu bound applications this non-expiration nuance
> +allows applications to briefly burst past their quota limits by the amount of
> +unused slice on each cpu that the task group is running on (typically at most
> +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> +applies if quota had been assigned to a cpu and then not fully used or returned
> +in previous periods. This burst amount will not be transferred between cores.
> +As a result, this mechanism still strictly limits the task group to quota
> +average usage, albeit over a longer time window than a single period.  This
> +also limits the burst ability to no more than 1ms per cpu.  This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines. It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu. Another way to say this, is that by allowing the unused
> +portion of a slice to remain valid across periods we have decreased the
> +possibility of wastefully expiring quota on cpu-local silos that don't need a
> +full slice's amount of cpu time.
> +
> +The interaction between cpu-bound and non-cpu-bound-interactive applications
> +should also be considered, especially when single core usage hits 100%. If you
> +gave each of these applications half of a cpu-core and they both got scheduled
> +on the same CPU it is theoretically possible that the non-cpu bound application
> +will use up to 1ms additional quota in some periods, thereby preventing the
> +cpu-bound application from fully using its quota by that same amount. In these
> +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> +decide which application is chosen to run, as they will both be runnable and
> +have remaining quota. This runtime discrepancy will be made up in the following
> +periods when the interactive application idles.
> +
>  Examples
>  --------
>  1. Limit a group to 1 CPU worth of runtime.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c3dfd4e28c46..b3e0159442d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4444,8 +4444,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  
>  	now = sched_clock_cpu(smp_processor_id());
>  	cfs_b->runtime = cfs_b->quota;
> -	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4467,8 +4465,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -	u64 amount = 0, min_amount, expires;
> -	int expires_seq;
> +	u64 amount = 0, min_amount;
>  
>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4485,61 +4482,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  			cfs_b->idle = 0;
>  		}
>  	}
> -	expires_seq = cfs_b->expires_seq;
> -	expires = cfs_b->runtime_expires;
>  	raw_spin_unlock(&cfs_b->lock);
>  
>  	cfs_rq->runtime_remaining += amount;
> -	/*
> -	 * we may have advanced our local expiration to account for allowed
> -	 * spread between our sched_clock and the one on which runtime was
> -	 * issued.
> -	 */
> -	if (cfs_rq->expires_seq != expires_seq) {
> -		cfs_rq->expires_seq = expires_seq;
> -		cfs_rq->runtime_expires = expires;
> -	}
>  
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -		return;
> -
> -	if (cfs_rq->runtime_remaining < 0)
> -		return;
> -
> -	/*
> -	 * If the local deadline has passed we have to consider the
> -	 * possibility that our sched_clock is 'fast' and the global deadline
> -	 * has not truly expired.
> -	 *
> -	 * Fortunately we can check determine whether this the case by checking
> -	 * whether the global deadline(cfs_b->expires_seq) has advanced.
> -	 */
> -	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -		/* extend local deadline, drift is bounded above by 2 ticks */
> -		cfs_rq->runtime_expires += TICK_NSEC;
> -	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> -	}
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>  	/* dock delta_exec before expiring quota (as it could span periods) */
>  	cfs_rq->runtime_remaining -= delta_exec;
> -	expire_cfs_rq_runtime(cfs_rq);
>  
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
> @@ -4725,8 +4678,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  		resched_curr(rq);
>  }
>  
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -		u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>  	struct cfs_rq *cfs_rq;
>  	u64 runtime;
> @@ -4751,7 +4703,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  		remaining -= runtime;
>  
>  		cfs_rq->runtime_remaining += runtime;
> -		cfs_rq->runtime_expires = expires;
>  
>  		/* we check whether we're throttled above */
>  		if (cfs_rq->runtime_remaining > 0)
> @@ -4776,7 +4727,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -	u64 runtime, runtime_expires;
> +	u64 runtime;
>  	int throttled;
>  
>  	/* no need to continue the timer with no bandwidth constraint */
> @@ -4804,8 +4755,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  	/* account preceding periods in which throttling occurred */
>  	cfs_b->nr_throttled += overrun;
>  
> -	runtime_expires = cfs_b->runtime_expires;
> -
>  	/*
>  	 * This check is repeated as we are holding onto the new bandwidth while
>  	 * we unthrottle. This can potentially race with an unthrottled group
> @@ -4818,8 +4767,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  		cfs_b->distribute_running = 1;
>  		raw_spin_unlock(&cfs_b->lock);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
> -		runtime = distribute_cfs_runtime(cfs_b, runtime,
> -						 runtime_expires);
> +		runtime = distribute_cfs_runtime(cfs_b, runtime);
>  		raw_spin_lock(&cfs_b->lock);
>  
>  		cfs_b->distribute_running = 0;
> @@ -4896,8 +4844,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		return;
>  
>  	raw_spin_lock(&cfs_b->lock);
> -	if (cfs_b->quota != RUNTIME_INF &&
> -	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +	if (cfs_b->quota != RUNTIME_INF) {
>  		cfs_b->runtime += slack_runtime;
>  
>  		/* we are under rq->lock, defer unthrottling using a timer */
> @@ -4929,7 +4876,6 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> -	u64 expires;
>  
>  	/* confirm we're still not at a refresh boundary */
>  	raw_spin_lock(&cfs_b->lock);
> @@ -4946,7 +4892,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>  		runtime = cfs_b->runtime;
>  
> -	expires = cfs_b->runtime_expires;
>  	if (runtime)
>  		cfs_b->distribute_running = 1;
>  
> @@ -4955,11 +4900,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock(&cfs_b->lock);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);
> +	lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
>  	raw_spin_unlock(&cfs_b->lock);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5e47c67382a1..40fb09c91dc3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -279,8 +279,6 @@ struct cfs_bandwidth {
>  	ktime_t period;
>  	u64 quota, runtime;
>  	s64 hierarchical_quota;
> -	u64 runtime_expires;
> -	int			expires_seq;
>  
>  	short			idle;
>  	short			period_active;
> @@ -494,8 +492,6 @@ struct cfs_rq {
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	int runtime_enabled;
> -	int			expires_seq;
> -	u64 runtime_expires;
>  	s64 runtime_remaining;
>  
>  	u64 throttled_clock, throttled_clock_task;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20191021/d67917cf/attachment.sig>