ACK: [SRU][Jammy][PATCH 0/1] AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Stefan Bader stefan.bader at canonical.com
Wed Jun 12 08:10:32 UTC 2024


On 12.06.24 03:57, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2068738
> 
> [Impact]
> 
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is enabled, the
> system fails to boot correctly, and all users see is a black screen.
> 
> This is caused by a null pointer dereference when enabling the IOMMU after the
> device has been initialised. It should happen the other way around.
> 
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
>   <TASK>
>   ? srso_return_thunk+0x5/0x10
>   ? show_trace_log_lvl+0x28e/0x2ea
>   ? show_trace_log_lvl+0x28e/0x2ea
>   ? dm_hw_fini+0x23/0x30 [amdgpu]
>   ? show_regs.part.0+0x23/0x29
>   ? __die_body.cold+0x8/0xd
>   ? __die+0x2b/0x37
>   ? page_fault_oops+0x13b/0x170
>   ? srso_return_thunk+0x5/0x10
>   ? do_user_addr_fault+0x321/0x670
>   ? srso_return_thunk+0x5/0x10
>   ? __free_pages_ok+0x34a/0x4f0
>   ? exc_page_fault+0x77/0x170
>   ? asm_exc_page_fault+0x27/0x30
>   ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>   dm_hw_fini+0x23/0x30 [amdgpu]
>   amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
>   amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
>   amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
>   amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
>   amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
>   local_pci_probe+0x4b/0x90
>   ? srso_return_thunk+0x5/0x10
>   pci_device_probe+0x119/0x200
>   really_probe+0x222/0x420
>   __driver_probe_device+0xe8/0x140
>   driver_probe_device+0x23/0xc0
>   __driver_attach+0xf7/0x1f0
>   ? __device_attach_driver+0x140/0x140
>   bus_for_each_dev+0x7f/0xd0
>   driver_attach+0x1e/0x30
>   bus_add_driver+0x148/0x220
>   ? srso_return_thunk+0x5/0x10
>   driver_register+0x95/0x100
>   __pci_register_driver+0x68/0x70
>   amdgpu_init+0x7c/0x1000 [amdgpu]
>   ? 0xffffffffc0e0b000
>   do_one_initcall+0x49/0x1e0
>   ? srso_return_thunk+0x5/0x10
>   ? kmem_cache_alloc_trace+0x19e/0x2e0
>   do_init_module+0x52/0x260
>   load_module+0xb45/0xbe0
>   __do_sys_finit_module+0xbf/0x120
>   __x64_sys_finit_module+0x18/0x20
>   x64_sys_call+0x1ac3/0x1fa0
>   do_syscall_64+0x56/0xb0
> ...
>   entry_SYSCALL_64_after_hwframe+0x67/0xd1
> 
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off" to
> GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
> 
> [Fix]
> 
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
> 
> commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> Author: Yifan Zhang <yifan1.zhang at amd.com>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
> 
> The fix is to revert this patch, as it was not suppose to be backported to 5.15
> stable.
> 
> The mailing list discussion with AMD developers is:
> 
> https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
> 
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so sending as a
> Ubuntu SAUCE patch. If the upstream status changes, we can NAK and resend.
> 
> [Testcase]
> 
> You need a system with an AMD Picasso/Raven 2 device. It will likely be an APU,
> and not a discrete graphics card, but any AMD Picasso/Raven 2 device is
> affected.
> 
> Install the kernel and boot. Make sure full modesetting is enabled.
> 
> There is a test kernel available in the ppa below:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
> 
> If you install the test kernel, your system should boot successfully.
> 
> [Where problems could occur]
> 
> We are reverting a problematic patch and going back to how it was before
> 5.15.0-112-generic. This should not cause any issues for users.
> 
> If a regression were to occur, users can set "nomodeset" or "amd_iommu=off" to
> GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their kernel to a working one.
> 
> The impact of a regression would be high, as users displays could be blank.
> 
> [Other Info]
> 
> User reports:
> https://forums.linuxmint.com/viewtopic.php?t=421484
> https://forums.linuxmint.com/viewtopic.php?t=421441
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> https://bugs.launchpad.net/bugs/2068812
> 
> As bizarre as it is, this commit was actually originally included in 5.15-rc5:
> 
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <yifan1.zhang at amd.com>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
> 
> It seems to have caused issues back then too, and was removed in the following
> fixups, in 5.16-rc1:
> 
> commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> Author: James Zhu <James.Zhu at amd.com>
> Date:   Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>      
> commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> Author: shaoyunl <shaoyun.liu at amd.com>
> Date:   Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> 
> I'm not exactly in favor of rewriting history twice, so I think we should just
> revert the upstream stable patch and move on.
> 
> Armin Wolf (1):
>    UBUNTU: SAUCE: Revert "drm/amdgpu: init iommu after amdkfd device
>      init"
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 

Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240612/2552546a/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240612/2552546a/attachment-0001.sig>


More information about the kernel-team mailing list