[Bug 2007993] Re: null pointer dereference in hsa_init
Steve Langasek
2007993 at bugs.launchpad.net
Fri May 19 23:50:17 UTC 2023
Hello Cory, or anyone else affected,
Accepted rocr-runtime into kinetic-proposed. The package will build now
and be available at https://launchpad.net/ubuntu/+source/rocr-
runtime/5.1.0-2ubuntu0.1 in a few hours, and then in the -proposed
repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
kinetic to verification-done-kinetic. If it does not fix the bug for
you, please add a comment stating that, and change the tag to
verification-failed-kinetic. In either case, without details of your
testing we will not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: rocr-runtime (Ubuntu Kinetic)
Status: In Progress => Fix Committed
** Tags added: verification-needed-kinetic
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2007993
Title:
null pointer dereference in hsa_init
Status in rocr-runtime package in Ubuntu:
Fix Released
Status in rocr-runtime source package in Jammy:
Fix Committed
Status in rocr-runtime source package in Kinetic:
Fix Committed
Status in rocr-runtime package in Debian:
Fix Released
Bug description:
[ Impact ]
The rocr-runtime provides the basic interface between compute code
written to run on AMD GPUs and the AMDGPU/AMDKFD driver within the
kernel. On Ubuntu 22.04, the library crashes with a segfault during
initialization. This bug makes the library unusable.
On Ubuntu 22.04, the main use for this library is in rocminfo, which
provides AMD GPU users with a description of the compute capabilities
of their hardware. For example, rocminfo provides the name of the ISA
for the hardware, which is useful for choosing compiler flags when
building GPU libraries from source. Invoking rocminfo is also an easy
way for novice users to find information about their hardware (e.g.,
for inclusion in bug reports filed against GPU libraries). It would
therefore be useful if this fix could be backported to Ubuntu 22.04.
The fix changes the order of initialization of a pair of static
variables in the rocr-runtime by moving them into the same translation
unit, thereby ensuring the order is both deterministic and correct.
[ Test Plan ]
To reproduce this bug, you will need an AMD GPU installed on the
machine. Then the following terminal commands should be sufficient to
cause a segfault originating in the rocr-runtime:
apt install rocminfo kmod
rocminfo
Once the bug is fixed, you should see detailed information about your
installed GPU hardware printed to standard output. This bug is
deterministic at runtime, so it is relatively easy to verify if you
have the necessary hardware.
On Ubuntu 22.04, the rocminfo utility is the only package that depends
on rocr-runtime, so this simple test is fairly comprehensive.
[ Where problems could occur ]
The rocr-runtime package is already badly broken, so the risk
associated with backporting a fix is low. If a mistake were made in
fixing this bug, the most likely outcome would be that the package
remains broken.
[ Other info ]
The same fix is in use on Debian Unstable, Ubuntu 23.04 and upstream,
so it is already being used in other environments (albeit with
different versions of rocr-runtime).
[ Original bug report ]
# System Information:
Description: Ubuntu 22.04.2 LTS
Release: 22.04
# Package Version:
libhsa-runtime64-1:
Installed: 5.0.0-1
Source: rocr-runtime
# What was done:
# on Ubuntu 22.04 or 22.10 with an AMD GPU installed
apt install rocminfo kmod
rocminfo
# What was seen:
ROCk module is loaded
Segmentation fault (core dumped)
Note that the rocminfo utility will not try to initialize libhsa-
runtime64 unless you have an AMD GPU installed, which is required to
reproduce this problem.
After some debugging, I came to the conclusion that this is a null
pointer dereference in libhsa-runtime64. The order of static
initialization is different when building the rocr-runtime package on
Ubuntu as compared to on Debian, and this results in the package
working on Debian but crashing when it's rebuilt for Ubuntu. A couple
of static variables are being copied before they are initialized,
leading to a null pointer dereference later on in the program.
# What was expected:
rocminfo should not crash
# Debian Bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031089
# Debian Patch:
https://salsa.debian.org/rocm-team/rocr-runtime/-/blob/debian/5.2.3-3/debian/patches/0003-fix-static-initialization-order.patch
The patch applied to the Debian package has fixed this bug in Ubuntu
23.04. It would be great if the fix could also be applied to Ubuntu
22.04 LTS. There's not a lot of ROCm functionality in Jammy, but
fixing this bug would at least get the basics like rocminfo working.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rocr-runtime/+bug/2007993/+subscriptions
More information about the Ubuntu-sponsors
mailing list