[Bug 2007993] Re: null pointer dereference in hsa_init
Erich Eickmeyer
2007993 at bugs.launchpad.net
Fri May 19 22:20:16 UTC 2023
** Changed in: rocr-runtime (Ubuntu Kinetic)
Status: New => In Progress
** Changed in: rocr-runtime (Ubuntu Kinetic)
Assignee: (unassigned) => Erich Eickmeyer (eeickmeyer)
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2007993
Title:
null pointer dereference in hsa_init
Status in rocr-runtime package in Ubuntu:
Fix Released
Status in rocr-runtime source package in Jammy:
Fix Committed
Status in rocr-runtime source package in Kinetic:
In Progress
Status in rocr-runtime package in Debian:
Fix Released
Bug description:
[ Impact ]
The rocr-runtime provides the basic interface between compute code
written to run on AMD GPUs and the AMDGPU/AMDKFD driver within the
kernel. On Ubuntu 22.04, the library crashes with a segfault during
initialization. This bug makes the library unusable.
On Ubuntu 22.04, the main use for this library is in rocminfo, which
provides AMD GPU users with a description of the compute capabilities
of their hardware. For example, rocminfo provides the name of the ISA
for the hardware, which is useful for choosing compiler flags when
building GPU libraries from source. Invoking rocminfo is also an easy
way for novice users to find information about their hardware (e.g.,
for inclusion in bug reports filed against GPU libraries). It would
therefore be useful if this fix could be backported to Ubuntu 22.04.
The fix changes the order of initialization of a pair of static
variables in the rocr-runtime by moving them into the same translation
unit, thereby ensuring the order is both deterministic and correct.
[ Test Plan ]
To reproduce this bug, you will need an AMD GPU installed on the
machine. Then the following terminal commands should be sufficient to
cause a segfault originating in the rocr-runtime:
apt install rocminfo kmod
rocminfo
Once the bug is fixed, you should see detailed information about your
installed GPU hardware printed to standard output. This bug is
deterministic at runtime, so it is relatively easy to verify if you
have the necessary hardware.
On Ubuntu 22.04, the rocminfo utility is the only package that depends
on rocr-runtime, so this simple test is fairly comprehensive.
[ Where problems could occur ]
The rocr-runtime package is already badly broken, so the risk
associated with backporting a fix is low. If a mistake were made in
fixing this bug, the most likely outcome would be that the package
remains broken.
[ Other info ]
The same fix is in use on Debian Unstable, Ubuntu 23.04 and upstream,
so it is already being used in other environments (albeit with
different versions of rocr-runtime).
[ Original bug report ]
# System Information:
Description: Ubuntu 22.04.2 LTS
Release: 22.04
# Package Version:
libhsa-runtime64-1:
Installed: 5.0.0-1
Source: rocr-runtime
# What was done:
# on Ubuntu 22.04 or 22.10 with an AMD GPU installed
apt install rocminfo kmod
rocminfo
# What was seen:
ROCk module is loaded
Segmentation fault (core dumped)
Note that the rocminfo utility will not try to initialize libhsa-
runtime64 unless you have an AMD GPU installed, which is required to
reproduce this problem.
After some debugging, I came to the conclusion that this is a null
pointer dereference in libhsa-runtime64. The order of static
initialization is different when building the rocr-runtime package on
Ubuntu as compared to on Debian, and this results in the package
working on Debian but crashing when it's rebuilt for Ubuntu. A couple
of static variables are being copied before they are initialized,
leading to a null pointer dereference later on in the program.
# What was expected:
rocminfo should not crash
# Debian Bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031089
# Debian Patch:
https://salsa.debian.org/rocm-team/rocr-runtime/-/blob/debian/5.2.3-3/debian/patches/0003-fix-static-initialization-order.patch
The patch applied to the Debian package has fixed this bug in Ubuntu
23.04. It would be great if the fix could also be applied to Ubuntu
22.04 LTS. There's not a lot of ROCm functionality in Jammy, but
fixing this bug would at least get the basics like rocminfo working.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rocr-runtime/+bug/2007993/+subscriptions
More information about the Ubuntu-sponsors
mailing list