[Bug 1826811] Re: Valgrind unhandled instruction 0xD5380000 on Aarch64

Mon Dec 16 12:04:48 UTC 2019

I tested on an ARM machine with the proposed package and it works fine.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to valgrind in Ubuntu.
https://bugs.launchpad.net/bugs/1826811

Title:
  Valgrind unhandled instruction 0xD5380000 on Aarch64

Status in valgrind package in Ubuntu:
  Fix Released
Status in valgrind source package in Bionic:
  Fix Committed
Status in valgrind package in Fedora:
  Fix Released

Bug description:
  [Impact]
  valgrind on bionic coredump and errors out as follows:

  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==11950== valgrind: Unrecognised instruction at address 0x4014c90.
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ==11950== Your program just tried to execute an instruction that Valgrind
  ==11950== did not recognise.  There are two possible reasons for this.
  ==11950== 1. Your program has a bug and erroneously jumped to a non-code
  ==11950==    location.  If you are running Memcheck and you just saw a
  ==11950==    warning about a bad jump, it's probably your program's fault.
  ==11950== 2. The instruction is legitimate but Valgrind doesn't handle it,
  ==11950==    i.e. it's Valgrind's fault.  If you think this is the case or
  ==11950==    you are not sure, please let us know and we'll try to fix it.
  ==11950== Either way, Valgrind will now raise a SIGILL signal which will
  ==11950== probably kill your program.
  ==11950==
  ==11950== Process terminating with default action of signal 4 (SIGILL)
  ==11950==  Illegal opcode at address 0x4014C90
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)

  The crash occurs because Valgrind is trying to simulate the CPU
  instructions when debugging a specific process. Valgrind tries to
  disassemble the whole instructions running by the process and insert
  the debugging instructions in run time. However, in this case,
  Valgrind cannot identify the MIDR_EL1 flag which happens in the "mrs
  %0, midr_el1" instruction. And this instruction means to read the CPU
  ID state register to %0(id) variable. asm volatile ("mrs %0, midr_el1"
  : "=r"(id)); so, Valrind cannot recognize what "midr_el1" is and then
  crashes.

  https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt
  ....
  d) CPU Identification :
      MIDR_EL1 is exposed to help identify the processor. On a
      heterogeneous system, this could be racy (just like getcpu()). The
      process could be migrated to another CPU by the time it uses the
      register value, unless the CPU affinity is set. Hence, there is no
      guarantee that the value reflects the processor that it is
      currently executing on. The REVIDR is not exposed due to this
      constraint, as REVIDR makes sense only in conjunction with the
      MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
      at:

   /sys/devices/system/cpu/cpu$ID/regs/identification/
                                                 \- midr
                                                 \- revidr

  [Test Case]

  1) Write a 'Hello World' program:
  ----
  #include <stdio.h>

  void main(void) {
  printf("Hello World!\n");
  };
  ----

  2) Build it:
  $ cc -o hello hello.c

  3) Then run valgrind on it:
  $ valgrind ./hello

  [Regression Potential]

  For the regression possibility, it should be fine.

  The symtpom happens when Valgrind is trying to disassemble code inside
  glibc (sysdeps/unix/sysv/linux/aarch64/cpu-features.c):

  Even if the HWCAP_CPUID is not supported, the default value is to
  assign 0 to the midr variable. So, I think it's not an important
  feature to support.

  As stated in the fix itself as a comment:

  ++ /* Limit the AT_HWCAP to just those features we explicitly
  ++   support in VEX.  */

  Additionally, the fix is found in Ubuntu already (disco and late).

  For some reasons, if a regression happens, the regression will be
  limited to ARM arch and shouldn't affect other cpu(s) architecture.

  [Other information]

  Upstream fix:
  https://sourceware.org/git/?p=valgrind.git;a=commit;h=fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42

  * For some reason, Xenial is not affected:
  ----
  # lsb_release -cs
  xenial

  # lscpu
  Architecture:          aarch64

  # valgrind ./hello 
  ==32367== Memcheck, a memory error detector
  ==32367== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
  ==32367== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
  ==32367== Command: ./hello
  ==32367== 
  Hello World!
  ==32367== 
  ==32367== HEAP SUMMARY:
  ==32367==     in use at exit: 0 bytes in 0 blocks
  ==32367==   total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated
  ==32367== 
  ==32367== All heap blocks were freed -- no leaks are possible
  ==32367== 
  ==32367== For counts of detected and suppressed errors, rerun with: -v
  ==32367== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
  ----

  * Only affecting Bionic:

  # git describe --contains fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42
  VALGRIND_3_14_0~96

  # rmadison valgrind
  => valgrind | 1:3.13.0-2ubuntu2.1      | bionic-updates
     valgrind | 1:3.14.0-2ubuntu6        | disco
     valgrind | 1:3.15.0-1ubuntu3.1      | eoan-updates
     valgrind | 1:3.15.0-1ubuntu5        | focal

  [Original Description]

  I'm performing Valgrind testing on an ElPotato running Ubuntu Bionic
  Aarch64 image. My program is dying like in
  https://bugs.kde.org/show_bug.cgi?id=381556 :

  ```
  $ valgrind --track-origins=yes --suppressions=cryptopp.supp ./cryptest.exe v
  ==12969== Memcheck, a memory error detector
  ==12969== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
  ==12969== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
  ==12969== Command: ./cryptest.exe v
  ==12969==
  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==12969== valgrind: Unrecognised instruction at address 0x4014c90.
  ==12969==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==12969==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==12969==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==12969==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==12969==    by 0x4001B47: _dl_start (rtld.c:523)
  ==12969==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ...
  ```

  Here's a similar Red Hat issue report:
  https://bugzilla.redhat.com/show_bug.cgi?id=1467952 .

  Please pickup the patch in the 381556 bug report.

  -----

  $ lsb_release -rd
  Description:    Ubuntu 18.04.2 LTS
  Release:        18.04

  $ apt-cache policy valgrind
  valgrind:
    Installed: 1:3.13.0-2ubuntu2.1
    Candidate: 1:3.13.0-2ubuntu2.1
    Version table:
   *** 1:3.13.0-2ubuntu2.1 500
          500 http://ports.ubuntu.com bionic-updates/main arm64 Packages
          100 /var/lib/dpkg/status
       1:3.13.0-2ubuntu2 500
          500 http://ports.ubuntu.com bionic/main arm64 Packages

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/1826811/+subscriptions