APPLIED: [SRU][Bionic][PATCH 0/1] sched: Prevent CPU lockups when task groups take longer than the period

Tue Jul 23 05:02:37 UTC 2019

On 2019-07-18 15:05:58 , Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1836971
> 
> [Impact]
> 
> On machines with extremely high CPU usage, parent task groups which have a large
> number of children can make the for loop in sched_cfs_period_timer() run until 
> the watchdog fires when the cfs_period_us setting is too short.
> 
> In this particular case, it is unlikely that the call to hrtimer_forward_now()
> will return 0, meaning the for loop is never left, and tasks are never 
> rescheduled.
> 
> The large number of children makes do_sched_cfs_period_timer() take longer than 
> the period, which impacts calls to hrtimer_forward_now().
> 
> The kernel will produce this call trace:
> 
> CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic #54-Ubuntu
> 
> Call Trace:
> <IRQ>
> ? sched_clock+0x9/0x10
> walk_tg_tree_from+0x61/0xd0
> ? task_rq_unlock+0x30/0x30
> unthrottle_cfs_rq+0xcb/0x1a0
> distribute_cfs_runtime+0xd7/0x100
> sched_cfs_period_timer+0xd9/0x1a0
> ? sched_cfs_slack_timer+0xe0/0xe0
> __hrtimer_run_queues+0xdf/0x230
> hrtimer_interrupt+0xa0/0x1d0
> smp_apic_timer_interrupt+0x6f/0x130
> apic_timer_interrupt+0x84/0x90
> </IRQ> 
> 
> This has been hit in production in a particularly highly utilised hadoop cluster
> which is powering an analytics platform. About 30% of the cluster experiences
> this issue every week, and the machines need a manual reboot to get back online.
> 
> [Fix]
> 
> This was fixed in 5.1 upstream with the below commit:
> 
> commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
> Author: Phil Auld <pauld at redhat.com>
> Date: Tue Mar 19 09:00:05 2019 -0400
> subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup 
> 
> This commit adds a check to see if the loop has run too many times, and if it
> has, scales up the period and quota, so the timer can complete before the
> next period expires, which enables the task to be rescheduled normally.
> 
> Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in upstream stable
> versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.
> 
> This patch requires minor backporting for 4.15, so please cherry pick 
> d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where the 
> backport has been done by the original author, to all bionic kernels.
> 
> [Testcase]
> 
> Kind of hard to reproduce, so this was tested on a production hadoop cluster
> with extremely high CPU load.
> 
> I built a test kernel, which is available here:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test
> 
> For unpatched kernels, expect the machine to lockup and print the call trace in 
> the impact section.
> 
> For patched kernels, if the machine hits the condition, it will print a warning
> to the kernel log with the new period and quota which has been used:
> 
> Example from the same hadoop cluster with a machine running the test kernel:
> 
> % uname -a
> 4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu 
> % sudo grep cfs /var/log/kern.log.*
> cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
> cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 22430, cfs_quota_us = 1148437)
> cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 25759, cfs_quota_us = 1318908)
> cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 29583, cfs_quota_us = 1514684)
> cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 33974, cfs_quota_us = 1739519)
> cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 39017, cfs_quota_us = 1997729)
> cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 44809, cfs_quota_us = 2294267)
> cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 51460, cfs_quota_us = 2634823)
> cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 59099, cfs_quota_us = 3025929)
> cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
> 
> [Regression Potential]
> This patch was accepted into upstream stable versions 4.4.179, 4.9.171, 
> 4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
> community.
> 
> Xenial received this patch in 4.4.0-150.176, as per LP #1828420
> Disco will receive this patch in the next version, as per LP #1830922
> Eoan already has the patch, being based on 5.2.
> 
> While this does effect a core part of the kernel, the scheduler, the patch has
> been extensively tested, and it has been proven in production environments,
> so the overall risk is low.
> 
> Phil Auld (1):
>   sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
> 
>  kernel/sched/fair.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> -- 
> 2.20.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team