[SRU][J][PATCH v2 0/2] WARN in trc_wait_for_one_reader on Xen instances

Fri Nov 22 17:52:50 UTC 2024

BugLink: https://bugs.launchpad.net/bugs/2089373

[Impact]

When ending bpf tracing, 5.15 kernels now report a warning in
trc_wait_for_one_reader() on platforms that support hot-plugging CPUs,
but that do not have all of their hotplug slots populated.  In this
submitter's environment, it reproduces on Xen EC2 instances, but not
Nitro ones.

The warning looks like this:

kernel: [ 6416.920266] ------------[ cut here ]------------
kernel: [ 6416.920272] trc_wait_for_one_reader(): smp_call_function_single() failed for CPU: 64
kernel: [ 6416.920289] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:1044 trc_wait_for_one_reader+0x2b8/0x300
kernel: [ 6416.920299] Modules linked in: xt_state xt_connmark nf_conntrack_netlink nfnetlink xt_addrtype xt_statistic xt_nat xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nvidia_uvm(POE) nvidia_drm(POE) drm_kms_helper cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt nvidia_modeset(POE) nvidia(POE) iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark iptable_nat nf_nat bpfilter aufs overlay udp_diag tcp_diag inet_diag binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel input_leds psmouse crypto_simd cryptd serio_raw floppy sch_fq_codel nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ena drm efi_pstore ip_tables x_tables autofs4
kernel: [ 6416.920368] CPU: 0 PID: 13 Comm: rcu_tasks_trace Tainted: P OE 5.15.0-1071-aws #77~20.04.1-Ubuntu
kernel: [ 6416.920372] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
kernel: [ 6416.920374] RIP: 0010:trc_wait_for_one_reader+0x2b8/0x300
kernel: [ 6416.920376] Code: 00 00 00 4c 89 ef e8 37 ac 4e 00 eb 9f 44 89 fa 48 c7 c6 00 63 e2 b8 48 c7 c7 a0 9a 1e b9 c6 05 2f 2e 09 02 01 e8 15 2e b9 00 <0f> 0b e9 31 ff ff ff 4c 89 ee 48 c7 c7 20 df b7 b9 e8 a2 99 52 00
kernel: [ 6416.920380] RSP: 0018:ffff9e048c4efe00 EFLAGS: 00010286
kernel: [ 6416.920382] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
kernel: [ 6416.920384] RDX: 0000000000000027 RSI: 0000000000000003 RDI: ffff93074ae20588
kernel: [ 6416.920385] RBP: ffff9e048c4efe28 R08: ffff93074ae20580 R09: 0000000000000001
kernel: [ 6416.920387] R10: 0000000000ffff0a R11: ffff93463feb2c7f R12: ffff92cbc6a1e600
kernel: [ 6416.920389] R13: 0000000000000040 R14: 00000000000205a4 R15: 0000000000000040
kernel: [ 6416.920390] FS: 0000000000000000(0000) GS:ffff93074ae00000(0000) knlGS:0000000000000000
kernel: [ 6416.920393] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 6416.920394] CR2: 00007f4a72b04098 CR3: 00000046c8964001 CR4: 00000000001706f0
kernel: [ 6416.920399] Call Trace:
kernel: [ 6416.920401] <TASK>
kernel: [ 6416.920404] ? show_regs.cold+0x1a/0x1f
kernel: [ 6416.920410] ? trc_wait_for_one_reader+0x2b8/0x300
kernel: [ 6416.920412] ? __warn+0x8b/0xe0
kernel: [ 6416.920418] ? trc_wait_for_one_reader+0x2b8/0x300
kernel: [ 6416.920421] ? report_bug+0xd5/0x110
kernel: [ 6416.920427] ? handle_bug+0x39/0x90
kernel: [ 6416.920431] ? exc_invalid_op+0x19/0x70
kernel: [ 6416.920434] ? asm_exc_invalid_op+0x1b/0x20
kernel: [ 6416.920442] ? trc_wait_for_one_reader+0x2b8/0x300
kernel: [ 6416.920446] rcu_tasks_trace_postscan+0x47/0x80
kernel: [ 6416.920449] rcu_tasks_wait_gp+0x108/0x210
kernel: [ 6416.920453] rcu_tasks_kthread+0x10f/0x1c0
kernel: [ 6416.920456] ? wait_woken+0x60/0x60
kernel: [ 6416.920462] ? show_rcu_tasks_trace_gp_kthread+0x80/0x80
kernel: [ 6416.920464] kthread+0x12a/0x150
kernel: [ 6416.920471] ? set_kthread_struct+0x50/0x50
kernel: [ 6416.920476] ret_from_fork+0x22/0x30
kernel: [ 6416.920485] </TASK>
kernel: [ 6416.920486] ---[ end trace 0500611ddaff33a7 ]---

The problem appears when:

- The system is performing a rcu_tasks_trace grace period wait
- The system has more hot plug CPU slots available than are populated
- The rcu tasks postscan detects a holdout

The problem is actually caused by a mismerge of 9b3c4ab304("sched,rcu:
Rework try_invoke_on_locked_down_task()").  When that patch was applied,
a conflict around task nesting was improperly resolved and lead to
quiescent tasks getting flagged as holdouts.  This in turn results in
more IPIs than necessary to idle CPUs, as well as WARNs about failing to
send IPIs to CPUs that aren't running.

The fix is a twofer: 1) manually correct the mismerge in the same way
that mainline resolved the conflict, and 2) backport an additional RCU
patch that confines the rcu_tasks postscan to only CPUs that are
running.

[Backport]

The upstream merge that shows the correct manual resolution of the merge
conflicts is in this commit:

   commit 6fedc28076bbbb32edb722e80f9406a3d1d668a8
   Merge tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

specifically:

 > @@ -951,18 +942,18 @@ static int trc_inspect_reader(struct task_struct *t, void *arg)
 >  		n_heavy_reader_updates++;
 >  		if (ofl)
 >  			n_heavy_reader_ofl_updates++;
 > -		in_qs = true;
 > +		nesting = 0;
 >  	} else {
 >  		// The task is not running, so C-language access is safe.
 > -		in_qs = likely(!t->trc_reader_nesting);
 > +		nesting = t->trc_reader_nesting;
 >  	}
 >  
 > -	// Mark as checked so that the grace-period kthread will
 > -	// remove it from the holdout list.
 > -	t->trc_reader_checked = true;
 > -
 > -	if (in_qs)
 > -		return 0;  // Already in quiescent state, done!!!
 > +	// If not exiting a read-side critical section, mark as checked
 > +	// so that the grace-period kthread will remove it from the
 > +	// holdout list.
 > +	t->trc_reader_checked = nesting >= 0;
 > +	if (nesting <= 0)
 > +		return nesting ? -EINVAL : 0;  // If in QS, done, otherwise try again later.

The additional rcu_tasks patch for only running postscan on online cpus
is:

   commit 5c9a9ca44fda41c5e82f50efced5297a9c19760d
   rcu-tasks: Idle tasks on offline CPUs are in quiescent

I've additionally reached out to upstream about including this in
stable:

https://lore.kernel.org/stable/cover.1732237776.git.kjlx@templeofstupid.com/

[Test]

A trivial reproducer for this problem is to use an up-to-date version of
bpftrace to run a kfunc probe, which when destroyed uses the
rcu_tasks_trace facility to cleanup:

   bpftrace -e 'kfunc:tcp_reset {@a = count();}'
   ^C

Is all that's necessary to reproduce the problem on a Xen EC2 system.

I've run with and without the patches applied and can confirm that one
and both are sufficient to resolve the problem.  Correcting the nesting
ensures that idling cpus don't get flagged as holdouts, and confining
the scan to just online cpus ensures that even if we incorrectly flag a
cpu as a holdout the warning won't trigger because sending the IPI won't
fail.

[Potential Regression]

The regression potential is low.  The corrected commit has been present
in mainline since 2021 and the fix to only run postscan on online CPUs
has been present since 2022.

Krister Johansen (1):
  UBUNTU: SAUCE: rcu-tasks: fix mismerge in trc_inspect_reader

Paul E. McKenney (1):
  rcu-tasks: Idle tasks on offline CPUs are in quiescent states

 kernel/rcu/tasks.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.25.1