PROBLEM: UBSAN enabled in 5.14 and 5.13.14 kernels leads to kernel crash
Andrew Moes
mail at andrii.me
Sat Sep 4 20:20:31 UTC 2021
Hi,
Problematic change: https://lists.ubuntu.com/archives/kernel-team/2021-August/123425.html
Affected kernels: 5.14, 5.14.1 and 5.13.14
Last working kernel: 5.13.13
Problem statement:
The server with NVDIMMs won't boot throwing lots of misleading errors.
I've spent significant amount time troubleshooting why our edge server won't boot with 5.14 kernel leading to various udev failures, acpi errors and failures to communicate with ipmi before coming to a full stop. After disabling most kernel modules that were being loaded when the kernel tainted, I finally isolated it to "nfit" module and had to pull it off the rack and had to physically remove NVDIMMs to make it boot again.
While I was going through upstream commits and config options changes between 5.13 and 5.14, I decided to install 5.13.14, which surprisingly led to the same crash and allowed me to narrow it down to enabled UBSAN. I rebuilt all problematic kernels with UBSAN off and it solved the issue immediately.
Available workarounds:
1. Remove all Intel Optane Persistent memory (PMEM, NVDIMM) or disable UBSAN.
Proposed action:
Disable UBSAN. As per: https://github.com/torvalds/linux/blob/master/lib/Kconfig.ubsan#L23 , having it enabled in this configuration may lead to undesired instability on systems that never had issues before.
Such issues are hard to troubleshoot and having it enabled may increase the entropy on running servers, potentially creating way more issues for engineers and services than it solves.
Slice of kernel logs:
Sep 04 14:12:33 pve-bfs-1 kernel: IPMI message handler: version 39.2
Sep 04 14:12:33 pve-bfs-1 kernel: ipmi device interface
Sep 04 14:12:33 pve-bfs-1 kernel: invalid opcode: 0000 [#1] SMP NOPTI
Sep 04 14:12:33 pve-bfs-1 kernel: CPU: 18 PID: 757 Comm: systemd-udevd Tainted: P O 5.14.1-1-edge #1
Sep 04 14:12:33 pve-bfs-1 kernel: Hardware name: GIGABYTE E251-U70-00/MU71-SU0-00, BIOS R07 09/15/2020
Sep 04 14:12:33 pve-bfs-1 kernel: RIP: 0010:acpi_ds_exec_end_op+0x184/0x77c
Sep 04 14:12:33 pve-bfs-1 kernel: Code: 77 28 48 8b 04 c5 a0 b3 0a 91 48 89 df ff d0 0f 1f 00 41 89 c6 e9 97 00 00 00 0f b6 43 0d 8d 50 ff 48 63 d2 48 83 fa 09 76 02 <0f> 0b 83 c0 6c 0f b7 7b 0a 48 89 da 44 89 45 d4 48 98 48 8d 34 c3
Sep 04 14:12:33 pve-bfs-1 kernel: RSP: 0018:ffffb0f402113658 EFLAGS: 00010286
Sep 04 14:12:33 pve-bfs-1 kernel: RAX: 0000000000000000 RBX: ffff9d80de53c000 RCX: 0000000000000040
Sep 04 14:12:33 pve-bfs-1 kernel: RDX: ffffffffffffffff RSI: ffffffff910ab220 RDI: 00000000000002cb
Sep 04 14:12:33 pve-bfs-1 kernel: RBP: ffffb0f402113688 R08: 0000000000000000 R09: ffff9d80e3902900
Sep 04 14:12:33 pve-bfs-1 kernel: R10: ffff9d80edbca5a0 R11: ffff9d80de53c038 R12: ffff9d80de53c000
Sep 04 14:12:33 pve-bfs-1 kernel: R13: ffff9d80e39029b0 R14: 0000000000000000 R15: 0000000000000000
Sep 04 14:12:33 pve-bfs-1 kernel: FS: 00007f3591bac8c0(0000) GS:ffff9d97a0e80000(0000) knlGS:0000000000000000
Sep 04 14:12:33 pve-bfs-1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 04 14:12:33 pve-bfs-1 kernel: CR2: 00007f35910fd412 CR3: 0000000118e40002 CR4: 00000000007706e0
Sep 04 14:12:33 pve-bfs-1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 04 14:12:33 pve-bfs-1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 04 14:12:33 pve-bfs-1 kernel: PKRU: 55555554
Sep 04 14:12:33 pve-bfs-1 kernel: Call Trace:
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_ps_parse_loop+0x845/0x921
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_ps_parse_aml+0x1af/0x550
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_ps_execute_method+0x208/0x2ca
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_ns_evaluate+0x34e/0x4f0
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_evaluate_object+0x18e/0x3b4
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_evaluate_dsm+0xb3/0x120
Sep 04 14:12:33 pve-bfs-1 kernel: ? acpi_evaluate_dsm+0xb3/0x120
Sep 04 14:12:33 pve-bfs-1 kernel: nfit_intel_shutdown_status+0xed/0x1b0 [nfit]
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_nfit_init+0x150c/0x1f70 [nfit]
Sep 04 14:12:33 pve-bfs-1 kernel: ? kfree+0xba/0x3b0
Sep 04 14:12:33 pve-bfs-1 kernel: ? acpi_ns_get_node+0xaa/0xb8
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_nfit_add+0x18d/0x1f0 [nfit]
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_device_probe+0x49/0x170
Sep 04 14:12:33 pve-bfs-1 kernel: really_probe+0x1fb/0x400
Sep 04 14:12:33 pve-bfs-1 kernel: __driver_probe_device+0x109/0x180
Sep 04 14:12:33 pve-bfs-1 kernel: driver_probe_device+0x23/0x90
Sep 04 14:12:33 pve-bfs-1 kernel: __driver_attach+0xac/0x1b0
Sep 04 14:12:33 pve-bfs-1 kernel: ? __device_attach_driver+0xe0/0xe0
Sep 04 14:12:33 pve-bfs-1 kernel: bus_for_each_dev+0x7c/0xc0
Sep 04 14:12:33 pve-bfs-1 kernel: driver_attach+0x1e/0x20
Sep 04 14:12:33 pve-bfs-1 kernel: bus_add_driver+0x135/0x1f0
Sep 04 14:12:33 pve-bfs-1 kernel: driver_register+0x91/0xf0
Sep 04 14:12:33 pve-bfs-1 kernel: acpi_bus_register_driver+0x39/0x50
Sep 04 14:12:33 pve-bfs-1 kernel: nfit_init+0x168/0x1000 [nfit]
Sep 04 14:12:33 pve-bfs-1 kernel: ? 0xffffffffc0604000
Sep 04 14:12:33 pve-bfs-1 kernel: do_one_initcall+0x46/0x1d0
Sep 04 14:12:33 pve-bfs-1 kernel: ? kmem_cache_alloc_trace+0x159/0x2c0
Sep 04 14:12:33 pve-bfs-1 kernel: do_init_module+0x62/0x280
Sep 04 14:12:33 pve-bfs-1 kernel: load_module+0x24ba/0x2730
Sep 04 14:12:33 pve-bfs-1 kernel: __do_sys_finit_module+0xbf/0x120
Sep 04 14:12:33 pve-bfs-1 kernel: __x64_sys_finit_module+0x1a/0x20
Sep 04 14:12:33 pve-bfs-1 kernel: do_syscall_64+0x59/0xc0
Sep 04 14:12:33 pve-bfs-1 kernel: ? ksys_mmap_pgoff+0x148/0x280
Sep 04 14:12:33 pve-bfs-1 kernel: ? exit_to_user_mode_prepare+0x37/0x1b0
Sep 04 14:12:33 pve-bfs-1 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Sep 04 14:12:33 pve-bfs-1 systemd[1]: Finished Coldplug All udev Devices.
Sep 04 14:12:33 pve-bfs-1 systemd-udevd[808]: Using default interface naming scheme 'v247'.
Sep 04 14:12:33 pve-bfs-1 systemd-udevd[801]: Using default interface naming scheme 'v247'.
Sep 04 14:12:33 pve-bfs-1 systemd-udevd[801]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 04 14:12:33 pve-bfs-1 systemd-udevd[808]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 04 14:12:33 pve-bfs-1 kernel: ? __x64_sys_mmap+0x33/0x40
Sep 04 14:12:33 pve-bfs-1 kernel: ? do_syscall_64+0x69/0xc0
Sep 04 14:12:33 pve-bfs-1 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Sep 04 14:12:33 pve-bfs-1 kernel: RIP: 0033:0x7f35920659b9
Sep 04 14:12:33 pve-bfs-1 kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
Sep 04 14:12:33 pve-bfs-1 kernel: RSP: 002b:00007ffc8ed7ddd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Sep 04 14:12:33 pve-bfs-1 kernel: RAX: ffffffffffffffda RBX: 0000563dc7a6b930 RCX: 00007f35920659b9
Sep 04 14:12:33 pve-bfs-1 kernel: RDX: 0000000000000000 RSI: 00007f35921f0e2d RDI: 0000000000000006
Sep 04 14:12:33 pve-bfs-1 kernel: RBP: 0000000000020000 R08: 0000000000000000 R09: 0000563dc7572841
Sep 04 14:12:33 pve-bfs-1 kernel: ZFS: Loaded module v2.1.0-1, ZFS pool version 5000, ZFS filesystem version 5
Sep 04 14:12:33 pve-bfs-1 kernel: R10: 0000000000000006 R11: 0000000000000246 R12: 00007f35921f0e2d
Sep 04 14:12:33 pve-bfs-1 kernel: R13: 0000000000000000 R14: 0000563dc7a699e0 R15: 0000563dc7a6b930
Sep 04 14:12:33 pve-bfs-1 kernel: Modules linked in: ipmi_devintf ipmi_msghandler fjes(+) nfit(+) acpi_power_meter(+) acpi_pad mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq hid_generic usbkbd usbmouse usbhid hid mlx5_ib ib_uverbs ib_core uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul mlx5_core mlxfw psample tls pci_hyperv_intf igb xhci_pci i2c_i801 xhci_pci_renesas i2c_algo_bit i2c_smbus lpc_ich dca ahci xhci_hcd libahci wmi
Sep 04 14:12:33 pve-bfs-1 kernel: ---[ end trace 6bd310ebdb659178 ]---
Thank you.
--Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210904/d3998e4d/attachment.html>
More information about the kernel-team
mailing list