Fwd: What's soft lockup?

Tue Jul 7 06:16:00 UTC 2020

Thank you for your detailed explanation.

>> Here is the most related log:
>> ul  3 10:24:35 yx kernel: [ 1240.166063] watchdog: BUG: soft lockup -
>> CPU#0 stuck for 22s! [rt_cansend:1837]
>> ...
>> Jul  3 10:24:35 yx kernel: [ 1240.166257] Call Trace:
>> Jul  3 10:24:35 yx kernel: [ 1240.166266]  _raw_spin_lock+0x20/0x30
>> Jul  3 10:24:35 yx kernel: [ 1240.166275]  can_write+0x6c/0x2c0 [advcan]
>> Jul  3 10:24:35 yx kernel: [ 1240.166281]  ? dequeue_signal+0xae/0x1a0
>> Jul  3 10:24:35 yx kernel: [ 1240.166286]  ? recalc_sigpending+0x1b/0x50
>> Jul  3 10:24:35 yx kernel: [ 1240.166291]  ? __set_task_blocked+0x3c/0xa0
>> Jul  3 10:24:35 yx kernel: [ 1240.166297]  __vfs_write+0x3a/0x190
>> Jul  3 10:24:35 yx kernel: [ 1240.166303]  ? apparmor_file_permission+0x1a/0x20
>> Jul  3 10:24:35 yx kernel: [ 1240.166307]  ? security_file_permission+0x3b/0xc0
>> Jul  3 10:24:35 yx kernel: [ 1240.166312]  vfs_write+0xb8/0x1b0
>> Jul  3 10:24:35 yx kernel: [ 1240.166316]  ksys_write+0x5c/0xe0
>> Jul  3 10:24:35 yx kernel: [ 1240.166321]  __x64_sys_write+0x1a/0x20
>> Jul  3 10:24:35 yx kernel: [ 1240.166326]  do_syscall_64+0x87/0x250
>> Jul  3 10:24:35 yx kernel: [ 1240.166331]
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>This stack trace tells us that kernel has trying to write through VFS
>layer and queued the write waiting for a lock. This lock is likely
>taken by some other kernel code.
I have two questions.
First one:
What're the relations beweeen "RIP: 0010:queued_spin_lock_slowpath+
0xd9/0x1a0" and " Call Trace: _raw_spin_lock+0x20/0x30"?
Does __raw_spin_lock() invoke queued_spin_lock_slowpath()?
I carefully read the related code snippet, but I don't think I can
draw such a conclusion.
What do you think about it?
For your convenience, here are the URLs of the related code snippets:
https://elixir.bootlin.com/linux/v4.19.84/source/include/linux/spinlock_api_smp.h#L139

https://elixir.bootlin.com/linux/v4.19.84/source/kernel/locking/qspinlock.c#L282

Second one:
Why does the kernel queue the write waiting for a lock? When such a
queue operation
should be taken?

>Affected task is: PID 1837 and task is rt_cansend. Now.. it is
>possible that the VFS write (a write to a filesystem) has NOT come
>directly from this PID. Its part of the complexities of kernel. If
>dirty pages is above a certain water mark is *may* use other tasks
>%sys% time to flush dirty pages, for example.
>So it may, or may not, come from this task.
I almost understand what you mean. As I carefully read the related code snippet,
I think the said "VFS write" is invoked(maybe, "triggered" is more
suitable here,
sorry for my poor English) while user application calls write function
to send out
some data through the can bus.
So I think it's come from this task(rt_cansend process, could be 1837
or other threads
of this process) itself.

>> Jul  3 10:21:11 yx kernel: [ 1035.655974] CAN[0][0] RX: FIFO overrun
>> Jul  3 10:21:54 yx kernel: [ 1078.664116] CAN[0][0] RX: FIFO overrun
>> Jul  3 10:22:36 yx kernel: [ 1120.688506] CAN[0][0] RX: FIFO overrun
>
>This is likely related to UART (serial controller) driver buffer and
>it is likely that you can ignore these errors.
It's the error messages of can driver, not UART.

>This is telling us that the instruction pointer (of the CPU) is a
>function called queued_spin_lock_slowpath and we're still waiting to
>acquire the lock (spinlock) that I mentioned before.
As the foremetioned question, What're the relations beweeen
queued_spin_lock_slowpath funtion and _raw_spin_lock?

>Then we have more of the same AND... RCU code complaining that the CPU
>0 was stalled. Then you had a NMI (non maskable interrupt) from the
>RCU watchdog so the kernel could print the stack trace:
The stack is printed at an "almost" random time, but the context of
the stacks is
the same. So I think it must have some relation with some spinlock.
As the stack indicates, I think it's the spinlock called by can_write function.
What do you think about it?

>Xenomai is a realtime kernel attempt so.. maybe the realtime part of
>the xenomai enabled kernel triggered a real VFS dead lock (spinlock)
>scenario =).
It's a good view.

>So check which files rt_cansend are writing to, and to which filesystem
>and it *may* give you an answer of who is responsible for the VFS dead
>lock.
It only opens the can device node(i.e. something like /dev/can) as the source
code of rt_cansend.c.

>Your kernel is dead locked in VFS layer trying to write in the
>filesystem by 2 PIDs:
>and both PIDs are spinning in CPUs trying to acquire a spin lock that
>has LIKELY been acquired by a task that has slept (scheduled out the
>CPU) and may never come back to the CPUs to unlock the lock because of
>other tasks are spinning (and the condition to release the lock won't
>happen).
Rt_cansend process consists of several threads(include 1835 and 1837).

>I hope I helped you somehow... talk to your kernel supporter, open a
>support case or try other kernels to check if the issue happens. You
>can always bisect kernel and find the offending commit IF you have a
>100% accurate reproducer.
I have the all source codes to build the kernel(the kernel is built myself).
I can easily reproduce this error.
But there is a huge problem, you see there are many commits for the
Linux kernel,
I don't know how to bisect kernel. Could you give me some suggestions?

Thank you for your attention to this matter.

孙世龙 sunshilong <sunshilong369 at gmail.com> 于2020年7月7日周二 上午9:45写道：
>Oh, and yes the VFS writes came from rt_cansend tasks as the stack
>shows the context switching for the system call from the beginning. So
>check which files rt_cansend are writing to, and to which filesystem
>and it *may* give you an answer of who is responsible for the VFS dead
>lock.
>
>The system call write() calls the VFS of the filesystem of the file
>being described by the file descriptor opened by the task. So
>rt_cansend has opened file descriptors to files it is writing to.
>Depending on the filesystem those files (inodes) are placed into, the
>corresponding VFS functions (for that filesystem) will be called.
>
>Okay, thats it.. g'luck =)
>
>On Tue, Jul 7, 2020 at 12:34 AM Rafael David Tinoco
><rafaeldtinoco at ubuntu.com> wrote:
>>
>> > Here is the most related log:
>> > ul  3 10:24:35 yx kernel: [ 1240.166063] watchdog: BUG: soft lockup -
>> > CPU#0 stuck for 22s! [rt_cansend:1837]
>> > ...
>> > Jul  3 10:24:35 yx kernel: [ 1240.166257] Call Trace:
>> > Jul  3 10:24:35 yx kernel: [ 1240.166266]  _raw_spin_lock+0x20/0x30
>> > Jul  3 10:24:35 yx kernel: [ 1240.166275]  can_write+0x6c/0x2c0 [advcan]
>> > Jul  3 10:24:35 yx kernel: [ 1240.166281]  ? dequeue_signal+0xae/0x1a0
>> > Jul  3 10:24:35 yx kernel: [ 1240.166286]  ? recalc_sigpending+0x1b/0x50
>> > Jul  3 10:24:35 yx kernel: [ 1240.166291]  ? __set_task_blocked+0x3c/0xa0
>> > Jul  3 10:24:35 yx kernel: [ 1240.166297]  __vfs_write+0x3a/0x190
>> > Jul  3 10:24:35 yx kernel: [ 1240.166303]  ? apparmor_file_permission+0x1a/0x20
>> > Jul  3 10:24:35 yx kernel: [ 1240.166307]  ? security_file_permission+0x3b/0xc0
>> > Jul  3 10:24:35 yx kernel: [ 1240.166312]  vfs_write+0xb8/0x1b0
>> > Jul  3 10:24:35 yx kernel: [ 1240.166316]  ksys_write+0x5c/0xe0
>> > Jul  3 10:24:35 yx kernel: [ 1240.166321]  __x64_sys_write+0x1a/0x20
>> > Jul  3 10:24:35 yx kernel: [ 1240.166326]  do_syscall_64+0x87/0x250
>> > Jul  3 10:24:35 yx kernel: [ 1240.166331]
>> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This stack trace tells us that kernel has trying to write through VFS
>> layer and queued the write waiting for a lock. This lock is likely
>> taken by some other kernel code.
>>
>> Affected task is: PID 1837 and task is rt_cansend. Now.. it is
>> possible that the VFS write (a write to a filesystem) has NOT come
>> directly from this PID. Its part of the complexities of kernel. If
>> dirty pages is above a certain water mark is *may* use other tasks
>> %sys% time to flush dirty pages, for example.
>>
>> So it may, or may not, come from this task.
>>
>> Note: I don't know much about xenomai and CAN messages.
>>
>> > Here is the full log:
>> > Jul  3 10:06:16 yx kernel: [  140.313856] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:06:59 yx kernel: [  183.323792] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:07:42 yx kernel: [  226.329465] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:08:24 yx kernel: [  268.362822] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:09:07 yx kernel: [  311.372488] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:09:50 yx kernel: [  354.377996] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:10:32 yx kernel: [  396.411726] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:11:15 yx kernel: [  439.421156] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:11:58 yx kernel: [  482.426522] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:12:40 yx kernel: [  524.460688] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:13:23 yx kernel: [  567.469857] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:14:06 yx kernel: [  610.475021] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:14:48 yx kernel: [  652.509597] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:15:31 yx kernel: [  695.518491] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:16:14 yx kernel: [  738.523551] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:16:55 yx kernel: [  779.558139] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:17:38 yx kernel: [  822.566773] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:18:21 yx kernel: [  865.571697] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:19:03 yx kernel: [  907.607049] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:19:46 yx kernel: [  950.615449] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:20:29 yx kernel: [  993.620196] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:21:11 yx kernel: [ 1035.655974] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:21:54 yx kernel: [ 1078.664116] CAN[0][0] RX: FIFO overrun
>> > Jul  3 10:22:36 yx kernel: [ 1120.688506] CAN[0][0] RX: FIFO overrun
>>
>> This is likely related to UART (serial controller) driver buffer and
>> it is likely that you can ignore these errors.
>>
>> > Jul  3 10:23:31 yx kernel: [ 1176.166058] watchdog: BUG: soft lockup -
>> > CPU#0 stuck for 22s! [rt_cansend:1837]
>> > Jul  3 10:23:31 yx kernel: [ 1176.166066] Modules linked in: bnep
>> > snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>> > nls_iso8859_1 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep
>> > snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl
>> > intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp
>> > kvm_intel snd_seq punit_atom_debug crct10dif_pclmul crc32_pclmul
>> > ghash_clmulni_intel snd_seq_device cryptd intel_cstate snd_timer
>> > hci_uart snd lpc_ich advcan(OE) mei_txe btqca soundcore mei btbcm
>> > btintel bluetooth ecdh_generic rfkill_gpio pwm_lpss_platform mac_hid
>> > pwm_lpss parport_pc ppdev lp parport autofs4 i915 kvmgt vfio_mdev mdev
>> > vfio_iommu_type1 vfio kvm irqbypass drm_kms_helper syscopyarea
>> > sysfillrect sysimgblt fb_sys_fops igb drm dca ahci i2c_algo_bit
>> > libahci video i2c_hid hid
>> > Jul  3 10:23:31 yx kernel: [ 1176.166204] CPU: 0 PID: 1837 Comm:
>> > rt_cansend Tainted: G           OE     4.19.84-solve-alc-failure #1
>> > Jul  3 10:23:31 yx kernel: [ 1176.166209] I-pipe domain: Linux
>> > Jul  3 10:23:31 yx kernel: [ 1176.166218] RIP:
>> > 0010:queued_spin_lock_slowpath+0xd9/0x1a0
>>
>> This is telling us that the instruction pointer (of the CPU) is a
>> function called queued_spin_lock_slowpath and we're still waiting to
>> acquire the lock (spinlock) that I mentioned before.
>>
>> Note: as this is a spin lock, it's likely that the task has been
>> consuming the CPU spinning in an infinite loop trying to get the lock,
>> and not being scheduled in and out the CPU => this explains why your
>> system is is bad state.
>>
>> > Jul  3 10:23:31 yx kernel: [ 1176.166223] Code: 48 03 34 c5 00 67 37
>> > 91 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
>> > 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66
>> > 45 31 c0 41 39 c8 0f 84 96 00 00
>> > Jul  3 10:23:31 yx kernel: [ 1176.166226] RSP: 0018:ffffbe6f4c17bd08
>> > EFLAGS: 00000202 ORIG_RAX: 00000000ffffff13
>> > Jul  3 10:23:31 yx kernel: [ 1176.166231] RAX: 0000000000000000 RBX:
>> > 0000000000000000 RCX: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166234] RDX: 0000000000000000 RSI:
>> > 0000000000000000 RDI: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166236] RBP: ffffbe6f4c17bd08 R08:
>> > 0000000000000000 R09: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166239] R10: 0000000000000000 R11:
>> > 0000000000000000 R12: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166241] R13: 0000000000000000 R14:
>> > 0000000000000000 R15: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166244] FS:  00007ff04533c940(0000)
>> > GS:ffff96003ba00000(0000) knlGS:0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166247] CS:  0010 DS: 0000 ES: 0000
>> > CR0: 0000000080050033
>> > Jul  3 10:23:31 yx kernel: [ 1176.166250] CR2: 00007f799c003000 CR3:
>> > 00000000ab0fe000 CR4: 00000000001006f0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166252] Call Trace:
>> > Jul  3 10:23:31 yx kernel: [ 1176.166261]  _raw_spin_lock+0x20/0x30
>> > Jul  3 10:23:31 yx kernel: [ 1176.166270]  can_write+0x6c/0x2c0 [advcan]
>> > Jul  3 10:23:31 yx kernel: [ 1176.166276]  ? dequeue_signal+0xae/0x1a0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166281]  ? recalc_sigpending+0x1b/0x50
>> > Jul  3 10:23:31 yx kernel: [ 1176.166286]  ? __set_task_blocked+0x3c/0xa0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166292]  __vfs_write+0x3a/0x190
>> > Jul  3 10:23:31 yx kernel: [ 1176.166298]  ? apparmor_file_permission+0x1a/0x20
>> > Jul  3 10:23:31 yx kernel: [ 1176.166302]  ? security_file_permission+0x3b/0xc0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166307]  vfs_write+0xb8/0x1b0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166312]  ksys_write+0x5c/0xe0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166316]  __x64_sys_write+0x1a/0x20
>> > Jul  3 10:23:31 yx kernel: [ 1176.166321]  do_syscall_64+0x87/0x250
>> > Jul  3 10:23:31 yx kernel: [ 1176.166326]
>> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > Jul  3 10:23:31 yx kernel: [ 1176.166330] RIP: 0033:0x7ff04885d4bd
>>
>> The unwinding of the stack trace here seem not to have worked because
>> we're running instructions from _raw_spin_lock() and it might be
>> optimized or inlined.
>>
>> Note: this is a very "high level" interpretation as you can see.
>>
>> > Jul  3 10:23:31 yx kernel: [ 1176.166334] Code: bf 20 00 00 75 10 b8
>> > 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae fc ff
>> > ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 f7 fc ff
>> > ff 48 89 d0 48 83 c4 08 48 3d 01
>> > Jul  3 10:23:31 yx kernel: [ 1176.166337] RSP: 002b:00007ff04533ba70
>> > EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> > Jul  3 10:23:31 yx kernel: [ 1176.166342] RAX: ffffffffffffffda RBX:
>> > 000000000153ca00 RCX: 00007ff04885d4bd
>> > Jul  3 10:23:31 yx kernel: [ 1176.166344] RDX: 0000000000000001 RSI:
>> > 00007ff04533bac0 RDI: 000000000000000e
>> > Jul  3 10:23:31 yx kernel: [ 1176.166347] RBP: 00007ff04533bbb0 R08:
>> > 0000000000000000 R09: 0000000000000000
>> > Jul  3 10:23:31 yx kernel: [ 1176.166349] R10: 0000000000000000 R11:
>> > 0000000000000246 R12: 00007ff0491783e0
>> > Jul  3 10:23:31 yx kernel: [ 1176.166352] R13: 00007ffc5d6b9b1f R14:
>> > 00007ff04533cc00 R15: 00000000004041b0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170065] watchdog: BUG: soft lockup -
>> > CPU#3 stuck for 22s! [rt_cansend:1835]
>>
>> Now we have rt_cansend again but pid 1835 =) and in another cpu, CPU 3.
>>
>> So it *could be* that both rt_cansend triggered an issue in the VFS
>> layer... but I'm not even sure what filesystem is being used and IF
>> rt_cansend is writing to the filesystem OR is just being used to flush
>> the dirty page cache for other processes.
>>
>> 2 things: Either ANOTHER running process (not showed here because it
>> is really executing), OR, worse, an already scheduled out process (BIG
>> kernel bug in this case as tasks holding spinlocks can't sleep => thus
>> can't schedule), are holding the spinlock (causing multiple calls to
>> rt_cansend to spin in CPUs forever).
>>
>> Like I said.. lots of assumptions as this is a "log-only" debugging.
>> Xenomai is a realtime kernel attempt so.. maybe the realtime part of
>> the xenomai enabled kernel triggered a real VFS dead lock (spinlock)
>> scenario =).
>>
>> > Jul  3 10:23:32 yx kernel: [ 1176.170075] Modules linked in: bnep
>> > snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>> > nls_iso8859_1 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep
>> > snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl
>> > intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp
>> > kvm_intel snd_seq punit_atom_debug crct10dif_pclmul crc32_pclmul
>> > ghash_clmulni_intel snd_seq_device cryptd intel_cstate snd_timer
>> > hci_uart snd lpc_ich advcan(OE) mei_txe btqca soundcore mei btbcm
>> > btintel bluetooth ecdh_generic rfkill_gpio pwm_lpss_platform mac_hid
>> > pwm_lpss parport_pc ppdev lp parport autofs4 i915 kvmgt vfio_mdev mdev
>> > vfio_iommu_type1 vfio kvm irqbypass drm_kms_helper syscopyarea
>> > sysfillrect sysimgblt fb_sys_fops igb drm dca ahci i2c_algo_bit
>> > libahci video i2c_hid hid
>> > Jul  3 10:23:32 yx kernel: [ 1176.170221] CPU: 3 PID: 1835 Comm:
>> > rt_cansend Tainted: G           OEL    4.19.84-bros-slove-alc-failure
>> > #1
>> > Jul  3 10:23:32 yx kernel: [ 1176.170226] I-pipe domain: Linux
>> > Jul  3 10:23:32 yx kernel: [ 1176.170236] RIP:
>> > 0010:queued_spin_lock_slowpath+0x13f/0x1a0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170242] Code: 6b ff ff ff eb e4 ba
>> > 00 01 00 00 8b 07 30 e4 09 d0 a9 00 01 ff ff 0f 85 fe fe ff ff 85 c0
>> > 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89
>> > 07 5d c3 f3 90 48 8b 32 48 85 f6
>> > Jul  3 10:23:32 yx kernel: [ 1176.170245] RSP: 0018:ffffbe6f4c16bd08
>> > EFLAGS: 00000202 ORIG_RAX: 00000000ffffff13
>> > Jul  3 10:23:32 yx kernel: [ 1176.170250] RAX: 0000000000000000 RBX:
>> > 0000000000000000 RCX: 0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170252] RDX: 0000000000000000 RSI:
>> > 0000000000000000 RDI: 0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170255] RBP: ffffbe6f4c16bd08 R08:
>> > 0000000000000000 R09: 0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170257] R10: 0000000000000000 R11:
>> > 0000000000000000 R12: 0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170259] R13: 0000000000000000 R14:
>> > 0000000000000000 R15: 0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170263] FS:  00007ff045526940(0000)
>> > GS:ffff96003bb80000(0000) knlGS:0000000000000000
>> > Jul  3 10:23:32 yx kernel: [ 1176.170266] CS:  0010 DS: 0000 ES: 0000
>> > CR0: 0000000080050033
>> > Jul  3 10:23:32 yx kernel: [ 1176.170268] CR2: 000000c8203a5000 CR3:
>> > 00000000ab0fe000 CR4: 00000000001006e0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170271] Call Trace:
>> > Jul  3 10:23:32 yx kernel: [ 1176.170282]  _raw_spin_lock+0x20/0x30
>> > Jul  3 10:23:32 yx kernel: [ 1176.170292]  can_write+0x6c/0x2c0 [advcan]
>> > Jul  3 10:23:32 yx kernel: [ 1176.170298]  ? dequeue_signal+0xae/0x1a0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170303]  ? recalc_sigpending+0x1b/0x50
>> > Jul  3 10:23:32 yx kernel: [ 1176.170308]  ? __set_task_blocked+0x3c/0xa0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170315]  __vfs_write+0x3a/0x190
>> > Jul  3 10:23:32 yx kernel: [ 1176.170321]  ? apparmor_file_permission+0x1a/0x20
>> > Jul  3 10:23:32 yx kernel: [ 1176.170325]  ? security_file_permission+0x3b/0xc0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170330]  vfs_write+0xb8/0x1b0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170335]  ksys_write+0x5c/0xe0
>> > Jul  3 10:23:32 yx kernel: [ 1176.170339]  __x64_sys_write+0x1a/0x20
>> > Jul  3 10:23:32 yx kernel: [ 1176.170345]  do_syscall_64+0x87/0x250
>> > Jul  3 10:23:32 yx kernel: [ 1176.170349]
>> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > Jul  3 10:23:32 yx kernel: [ 1176.170354] RIP: 0033:0x7ff04885d4bd
>>
>> Then we have more of the same AND... RCU code complaining that the CPU
>> 0 was stalled. Then you had a NMI (non maskable interrupt) from the
>> RCU watchdog so the kernel could print the stack trace:
>>
>> > Jul  3 10:24:08 yx kernel: [ 1212.410060] rcu: INFO: rcu_sched
>> > self-detected stall on CPU
>> > Jul  3 10:24:08 yx kernel: [ 1212.410075] rcu: 0-....: (14999 ticks
>> > this GP) idle=8de/1/0x4000000000000002 softirq=15317/15317 fqs=7496
>> > Jul  3 10:24:08 yx kernel: [ 1212.410079] rcu: (t=15000 jiffies g=17909 q=591)
>> > Jul  3 10:24:08 yx kernel: [ 1212.410087] NMI backtrace for cpu 0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410092] CPU: 0 PID: 1837 Comm:
>> > rt_cansend Tainted: G           OEL    4.19.84-bros-slove-alc-failure
>> > #1
>> > Jul  3 10:24:08 yx kernel: [ 1212.410096] I-pipe domain: Linux
>> > Jul  3 10:24:08 yx kernel: [ 1212.410099] Call Trace:
>> > Jul  3 10:24:08 yx kernel: [ 1212.410103]  <IRQ>
>> > Jul  3 10:24:08 yx kernel: [ 1212.410113]  dump_stack+0x9e/0xc8
>> > Jul  3 10:24:08 yx kernel: [ 1212.410118]  nmi_cpu_backtrace+0x95/0xa0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410123]  ? lapic_can_unplug_cpu+0xb0/0xb0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410128]
>> > nmi_trigger_cpumask_backtrace+0x90/0xd0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410132]
>> > arch_trigger_cpumask_backtrace+0x19/0x20
>> > Jul  3 10:24:08 yx kernel: [ 1212.410137]  rcu_dump_cpu_stacks+0x94/0xc8
>> > Jul  3 10:24:08 yx kernel: [ 1212.410141]  rcu_check_callbacks+0x707/0x870
>> > Jul  3 10:24:08 yx kernel: [ 1212.410148]  ? tick_sched_do_timer+0x60/0x60
>> > Jul  3 10:24:08 yx kernel: [ 1212.410152]  update_process_times+0x3d/0x70
>> > Jul  3 10:24:08 yx kernel: [ 1212.410157]  tick_sched_handle+0x29/0x60
>> > Jul  3 10:24:08 yx kernel: [ 1212.410161]  tick_sched_timer+0x3c/0x80
>> > Jul  3 10:24:08 yx kernel: [ 1212.410166]  __hrtimer_run_queues+0x106/0x270
>> > Jul  3 10:24:08 yx kernel: [ 1212.410171]  hrtimer_interrupt+0x116/0x240
>> > Jul  3 10:24:08 yx kernel: [ 1212.410177]  smp_apic_timer_interrupt+0x59/0x130
>> > Jul  3 10:24:08 yx kernel: [ 1212.410182]  ?
>> > smp_call_function_single_interrupt+0xd0/0xd0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410186]  __ipipe_do_IRQ+0x31/0x40
>> > Jul  3 10:24:08 yx kernel: [ 1212.410192]  __ipipe_do_sync_stage+0x173/0x180
>> > Jul  3 10:24:08 yx kernel: [ 1212.410196]  __ipipe_do_sync_pipeline+0x9b/0xa0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410201]  dispatch_irq_head+0xe6/0x110
>> > Jul  3 10:24:08 yx kernel: [ 1212.410205]  __ipipe_dispatch_irq+0xd9/0x1c0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410210]  __ipipe_handle_irq+0x86/0x1e0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410215]  apic_timer_interrupt+0x12/0x40
>> > Jul  3 10:24:08 yx kernel: [ 1212.410218]  </IRQ>
>> > Jul  3 10:24:08 yx kernel: [ 1212.410223] RIP:
>> > 0010:queued_spin_lock_slowpath+0xd9/0x1a0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410228] Code: 48 03 34 c5 00 67 37
>> > 91 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
>> > 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66
>> > 45 31 c0 41 39 c8 0f 84 96 00 00
>> > Jul  3 10:24:08 yx kernel: [ 1212.410231] RSP: 0018:ffffbe6f4c17bd08
>> > EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>> > Jul  3 10:24:08 yx kernel: [ 1212.410236] RAX: 0000000000040101 RBX:
>> > 0000000000000001 RCX: 0000000000040000
>> > Jul  3 10:24:08 yx kernel: [ 1212.410238] RDX: ffff96003ba350c0 RSI:
>> > 0000000000000000 RDI: ffffffffc069cc20
>> > Jul  3 10:24:08 yx kernel: [ 1212.410241] RBP: ffffbe6f4c17bd08 R08:
>> > 00007ff04533bac1 R09: ffff96001e06ac68
>> > Jul  3 10:24:08 yx kernel: [ 1212.410243] R10: ffff95ff34a27f38 R11:
>> > 0000000000000000 R12: 0000000000000000
>> > Jul  3 10:24:08 yx kernel: [ 1212.410246] R13: 00007ff04533bac0 R14:
>> > ffff95ff34a27f00 R15: 00007ff04533bac0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410255]  _raw_spin_lock+0x20/0x30
>> > Jul  3 10:24:08 yx kernel: [ 1212.410262]  can_write+0x6c/0x2c0 [advcan]
>> > Jul  3 10:24:08 yx kernel: [ 1212.410267]  ? dequeue_signal+0xae/0x1a0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410273]  ? recalc_sigpending+0x1b/0x50
>> > Jul  3 10:24:08 yx kernel: [ 1212.410277]  ? __set_task_blocked+0x3c/0xa0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410283]  __vfs_write+0x3a/0x190
>> > Jul  3 10:24:08 yx kernel: [ 1212.410289]  ? apparmor_file_permission+0x1a/0x20
>> > Jul  3 10:24:08 yx kernel: [ 1212.410293]  ? security_file_permission+0x3b/0xc0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410298]  vfs_write+0xb8/0x1b0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410302]  ksys_write+0x5c/0xe0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410307]  __x64_sys_write+0x1a/0x20
>> > Jul  3 10:24:08 yx kernel: [ 1212.410312]  do_syscall_64+0x87/0x250
>> > Jul  3 10:24:08 yx kernel: [ 1212.410317]
>> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > Jul  3 10:24:08 yx kernel: [ 1212.410321] RIP: 0033:0x7ff04885d4bd
>> > Jul  3 10:24:08 yx kernel: [ 1212.410325] Code: bf 20 00 00 75 10 b8
>> > 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae fc ff
>> > ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 f7 fc ff
>> > ff 48 89 d0 48 83 c4 08 48 3d 01
>> > Jul  3 10:24:08 yx kernel: [ 1212.410327] RSP: 002b:00007ff04533ba70
>> > EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> > Jul  3 10:24:08 yx kernel: [ 1212.410332] RAX: ffffffffffffffda RBX:
>> > 000000000153ca00 RCX: 00007ff04885d4bd
>> > Jul  3 10:24:08 yx kernel: [ 1212.410334] RDX: 0000000000000001 RSI:
>> > 00007ff04533bac0 RDI: 000000000000000e
>> > Jul  3 10:24:08 yx kernel: [ 1212.410337] RBP: 00007ff04533bbb0 R08:
>> > 0000000000000000 R09: 0000000000000000
>> > Jul  3 10:24:08 yx kernel: [ 1212.410339] R10: 0000000000000000 R11:
>> > 0000000000000246 R12: 00007ff0491783e0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410342] R13: 00007ffc5d6b9b1f R14:
>> > 00007ff04533cc00 R15: 00000000004041b0
>> > Jul  3 10:24:08 yx kernel: [ 1212.410351] Sending NMI from CPU 0 to CPUs 3:
>> > Jul  3 10:24:08 yx kernel: [ 1212.414061] NMI backtrace for cpu 3
>> > Jul  3 10:24:08 yx kernel: [ 1212.414064] CPU: 3 PID: 1835 Comm:
>> > rt_cansend Tainted: G           OEL    4.19.84-bros-slove-alc-failure
>> > #1
>> > Jul  3 10:24:08 yx kernel: [ 1212.414068] I-pipe domain: Linux
>> > Jul  3 10:24:08 yx kernel: [ 1212.414070] RIP:
>> > 0010:queued_spin_lock_slowpath+0x13f/0x1a0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414073] Code: 6b ff ff ff eb e4 ba
>> > 00 01 00 00 8b 07 30 e4 09 d0 a9 00 01 ff ff 0f 85 fe fe ff ff 85 c0
>> > 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89
>> > 07 5d c3 f3 90 48 8b 32 48 85 f6
>> > Jul  3 10:24:08 yx kernel: [ 1212.414075] RSP: 0018:ffffbe6f4c16bd08
>> > EFLAGS: 00000202
>> > Jul  3 10:24:08 yx kernel: [ 1212.414078] RAX: 0000000000040101 RBX:
>> > 0000000000000001 RCX: ffffbe6f4c16bec8
>> > Jul  3 10:24:08 yx kernel: [ 1212.414080] RDX: 0000000000000000 RSI:
>> > 0000000000000000 RDI: ffffffffc069cc20
>> > Jul  3 10:24:08 yx kernel: [ 1212.414082] RBP: ffffbe6f4c16bd08 R08:
>> > 00007ff045525581 R09: ffff96001e06ac68
>> > Jul  3 10:24:08 yx kernel: [ 1212.414084] R10: ffff95ff34a27f38 R11:
>> > 0000000000000000 R12: 0000000000000000
>> > Jul  3 10:24:08 yx kernel: [ 1212.414086] R13: 00007ff045525580 R14:
>> > ffff95ff34a27f00 R15: 00007ff045525580
>> > Jul  3 10:24:08 yx kernel: [ 1212.414088] FS:  00007ff045526940(0000)
>> > GS:ffff96003bb80000(0000) knlGS:0000000000000000
>> > Jul  3 10:24:08 yx kernel: [ 1212.414090] CS:  0010 DS: 0000 ES: 0000
>> > CR0: 0000000080050033
>> > Jul  3 10:24:08 yx kernel: [ 1212.414092] CR2: 000000c8203a5000 CR3:
>> > 00000000ab0fe000 CR4: 00000000001006e0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414093] Call Trace:
>> > Jul  3 10:24:08 yx kernel: [ 1212.414095]  _raw_spin_lock+0x20/0x30
>> > Jul  3 10:24:08 yx kernel: [ 1212.414097]  can_write+0x6c/0x2c0 [advcan]
>> > Jul  3 10:24:08 yx kernel: [ 1212.414099]  ? dequeue_signal+0xae/0x1a0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414100]  ? recalc_sigpending+0x1b/0x50
>> > Jul  3 10:24:08 yx kernel: [ 1212.414102]  ? __set_task_blocked+0x3c/0xa0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414104]  __vfs_write+0x3a/0x190
>> > Jul  3 10:24:08 yx kernel: [ 1212.414106]  ? apparmor_file_permission+0x1a/0x20
>> > Jul  3 10:24:08 yx kernel: [ 1212.414107]  ? security_file_permission+0x3b/0xc0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414109]  vfs_write+0xb8/0x1b0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414111]  ksys_write+0x5c/0xe0
>> > Jul  3 10:24:08 yx kernel: [ 1212.414113]  __x64_sys_write+0x1a/0x20
>> > Jul  3 10:24:08 yx kernel: [ 1212.414114]  do_syscall_64+0x87/0x250
>> > Jul  3 10:24:08 yx kernel: [ 1212.414116]
>> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > Jul  3 10:24:08 yx kernel: [ 1212.414118] RIP: 0033:0x7ff04885d4bd
>> > Jul  3 10:24:08 yx kernel: [ 1212.414120] Code: bf 20 00 00 75 10 b8
>> > 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae fc ff
>> > ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 f7 fc ff
>> > ff 48 89 d0 48 83 c4 08 48 3d 01
>> > Jul  3 10:24:08 yx kernel: [ 1212.414122] RSP: 002b:00007ff045525530
>> > EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> > Jul  3 10:24:08 yx kernel: [ 1212.414126] RAX: ffffffffffffffda RBX:
>> > 000000000000ffdc RCX: 00007ff04885d4bd
>> > Jul  3 10:24:08 yx kernel: [ 1212.414128] RDX: 0000000000000001 RSI:
>> > 00007ff045525580 RDI: 000000000000000e
>> > Jul  3 10:24:08 yx kernel: [ 1212.414130] RBP: 00007ff045525670 R08:
>> > 0000000000000000 R09: 0000000000000000
>> > Jul  3 10:24:08 yx kernel: [ 1212.414132] R10: 00007ff030000078 R11:
>> > 0000000000000246 R12: 00007ff045526878
>> > Jul  3 10:24:08 yx kernel: [ 1212.414134] R13: 00007ff045525be0 R14:
>> > 000000000000001c R15: 0000000000000007
>> > Jul  3 10:24:35 yx kernel: [ 1240.166063] watchdog: BUG: soft lockup -
>> > CPU#0 stuck for 22s! [rt_cansend:1837]
>>
>> and more of the same.
>>
>> Your kernel is dead locked in VFS layer trying to write in the
>> filesystem by 2 PIDs:
>>
>> rt_cansend: 1835 and 1837
>>
>> and both PIDs are spinning in CPUs trying to acquire a spin lock that
>> has LIKELY been acquired by a task that has slept (scheduled out the
>> CPU) and may never come back to the CPUs to unlock the lock because of
>> other tasks are spinning (and the condition to release the lock won't
>> happen).
>>
>> NOTE: This was a very fast and very inaccurate reading of your logs
>> (for other debuggers around, so be nice in critics).
>>
>> I hope I helped you somehow... talk to your kernel supporter, open a
>> support case or try other kernels to check if the issue happens. You
>> can always bisect kernel and find the offending commit IF you have a
>> 100% accurate reproducer.
>>
>> Best
>>
>> -rafaeldtinoco