[Bug 2007993] Re: null pointer dereference in hsa_init

Steve Langasek 2007993 at bugs.launchpad.net
Fri May 19 23:50:17 UTC 2023


Hello Cory, or anyone else affected,

Accepted rocr-runtime into kinetic-proposed. The package will build now
and be available at https://launchpad.net/ubuntu/+source/rocr-
runtime/5.1.0-2ubuntu0.1 in a few hours, and then in the -proposed
repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
kinetic to verification-done-kinetic. If it does not fix the bug for
you, please add a comment stating that, and change the tag to
verification-failed-kinetic. In either case, without details of your
testing we will not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: rocr-runtime (Ubuntu Kinetic)
       Status: In Progress => Fix Committed

** Tags added: verification-needed-kinetic

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2007993

Title:
  null pointer dereference in hsa_init

Status in rocr-runtime package in Ubuntu:
  Fix Released
Status in rocr-runtime source package in Jammy:
  Fix Committed
Status in rocr-runtime source package in Kinetic:
  Fix Committed
Status in rocr-runtime package in Debian:
  Fix Released

Bug description:
  [ Impact ]

  The rocr-runtime provides the basic interface between compute code
  written to run on AMD GPUs and the AMDGPU/AMDKFD driver within the
  kernel. On Ubuntu 22.04, the library crashes with a segfault during
  initialization. This bug makes the library unusable.

  On Ubuntu 22.04, the main use for this library is in rocminfo, which
  provides AMD GPU users with a description of the compute capabilities
  of their hardware. For example, rocminfo provides the name of the ISA
  for the hardware, which is useful for choosing compiler flags when
  building GPU libraries from source. Invoking rocminfo is also an easy
  way for novice users to find information about their hardware (e.g.,
  for inclusion in bug reports filed against GPU libraries). It would
  therefore be useful if this fix could be backported to Ubuntu 22.04.

  The fix changes the order of initialization of a pair of static
  variables in the rocr-runtime by moving them into the same translation
  unit, thereby ensuring the order is both deterministic and correct.

  [ Test Plan ]

  To reproduce this bug, you will need an AMD GPU installed on the
  machine. Then the following terminal commands should be sufficient to
  cause a segfault originating in the rocr-runtime:

      apt install rocminfo kmod
      rocminfo

  Once the bug is fixed, you should see detailed information about your
  installed GPU hardware printed to standard output. This bug is
  deterministic at runtime, so it is relatively easy to verify if you
  have the necessary hardware.

  On Ubuntu 22.04, the rocminfo utility is the only package that depends
  on rocr-runtime, so this simple test is fairly comprehensive.

  [ Where problems could occur ]

  The rocr-runtime package is already badly broken, so the risk
  associated with backporting a fix is low. If a mistake were made in
  fixing this bug, the most likely outcome would be that the package
  remains broken.

  [ Other info ]

  The same fix is in use on Debian Unstable, Ubuntu 23.04 and upstream,
  so it is already being used in other environments (albeit with
  different versions of rocr-runtime).

  [ Original bug report ]
   
  # System Information:
  Description:	Ubuntu 22.04.2 LTS
  Release:	22.04

  # Package Version:
  libhsa-runtime64-1:
    Installed: 5.0.0-1
    Source: rocr-runtime

  # What was done:

      # on Ubuntu 22.04 or 22.10 with an AMD GPU installed
      apt install rocminfo kmod
      rocminfo

  # What was seen:

      ROCk module is loaded
      Segmentation fault (core dumped)

  Note that the rocminfo utility will not try to initialize libhsa-
  runtime64 unless you have an AMD GPU installed, which is required to
  reproduce this problem.

  After some debugging, I came to the conclusion that this is a null
  pointer dereference in libhsa-runtime64. The order of static
  initialization is different when building the rocr-runtime package on
  Ubuntu as compared to on Debian, and this results in the package
  working on Debian but crashing when it's rebuilt for Ubuntu. A couple
  of static variables are being copied before they are initialized,
  leading to a null pointer dereference later on in the program.

  # What was expected:
  rocminfo should not crash

  # Debian Bug:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031089

  # Debian Patch:
  https://salsa.debian.org/rocm-team/rocr-runtime/-/blob/debian/5.2.3-3/debian/patches/0003-fix-static-initialization-order.patch

  The patch applied to the Debian package has fixed this bug in Ubuntu
  23.04. It would be great if the fix could also be applied to Ubuntu
  22.04 LTS. There's not a lot of ROCm functionality in Jammy, but
  fixing this bug would at least get the basics like rocminfo working.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rocr-runtime/+bug/2007993/+subscriptions




More information about the Ubuntu-sponsors mailing list