[Bug 1826811] Re: Valgrind unhandled instruction 0xD5380000 on Aarch64

Thu Dec 12 12:49:17 UTC 2019

** Description changed:

- ## DRAFT ###
  [Impact]
  valgrind on bionic coredump and errors out as follows:

  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==11950== valgrind: Unrecognised instruction at address 0x4014c90.
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ==11950== Your program just tried to execute an instruction that Valgrind
  ==11950== did not recognise.  There are two possible reasons for this.
  ==11950== 1. Your program has a bug and erroneously jumped to a non-code
  ==11950==    location.  If you are running Memcheck and you just saw a
  ==11950==    warning about a bad jump, it's probably your program's fault.
  ==11950== 2. The instruction is legitimate but Valgrind doesn't handle it,
  ==11950==    i.e. it's Valgrind's fault.  If you think this is the case or
  ==11950==    you are not sure, please let us know and we'll try to fix it.
  ==11950== Either way, Valgrind will now raise a SIGILL signal which will
  ==11950== probably kill your program.
- ==11950== 
+ ==11950==
  ==11950== Process terminating with default action of signal 4 (SIGILL)
  ==11950==  Illegal opcode at address 0x4014C90
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)

+ The crash occurs because Valgrind is trying to simulate the CPU
+ instructions when debugging a specific process. Valgrind tries to
+ disassemble the whole instructions running by the process and insert the
+ debugging instructions in run time. However, in this case, Valgrind
+ cannot identify the MIDR_EL1 flag which happens in the "mrs %0,
+ midr_el1" instruction. And this instruction means to read the CPU ID
+ state register to %0(id) variable. asm volatile ("mrs %0, midr_el1" :
+ "=r"(id)); so, Valrind cannot recognize what "midr_el1" is and then
+ crashes.
+ 
+ 
+ https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt
+ ....
+ d) CPU Identification :
+     MIDR_EL1 is exposed to help identify the processor. On a
+     heterogeneous system, this could be racy (just like getcpu()). The
+     process could be migrated to another CPU by the time it uses the
+     register value, unless the CPU affinity is set. Hence, there is no
+     guarantee that the value reflects the processor that it is
+     currently executing on. The REVIDR is not exposed due to this
+     constraint, as REVIDR makes sense only in conjunction with the
+     MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
+     at:
+ 
+ 	/sys/devices/system/cpu/cpu$ID/regs/identification/
+ 	                                              \- midr
+ 	                                              \- revidr

  [Test Case]

  1) Write a 'Hello World' program:
  ----
  #include <stdio.h>

  void main(void) {
  printf("Hello World!\n");
  };
  ----

  2) Build it:
  $ cc -o hello hello.c

  3) Then run valgrind on it:
  $ valgrind ./hello

  [Regression Potential]

+ For the regression possibility, it should be fine.
+ 
+ The symtpom happens when Valgrind is trying to disassemble code inside
+ glibc (sysdeps/unix/sysv/linux/aarch64/cpu-features.c):
+ 
+ Even if the HWCAP_CPUID is not supported, the default value is to assign
+ 0 to the midr variable. So, I think it's not an important feature to
+ support.
+ 
+ Additionally, the fix is found in Ubuntu already (disco and late).
+ 
+ For some reasons, if a regression happens, the regression will be
+ limited to ARM arch and shouldn't affect other cpu(s) architecture.
+ 
  [Other information]

- Upstream fix: 
+ Upstream fix:
  https://sourceware.org/git/?p=valgrind.git;a=commit;h=fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42

  * Only affecting Bionic:

  # git describe --contains fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42
  VALGRIND_3_14_0~96

  # rmadison valgrind
- => valgrind | 1:3.13.0-2ubuntu2.1      | bionic-updates  
-    valgrind | 1:3.14.0-2ubuntu6        | disco                      
-    valgrind | 1:3.15.0-1ubuntu3.1      | eoan-updates    
-    valgrind | 1:3.15.0-1ubuntu5        | focal          
- 
+ => valgrind | 1:3.13.0-2ubuntu2.1      | bionic-updates
+    valgrind | 1:3.14.0-2ubuntu6        | disco
+    valgrind | 1:3.15.0-1ubuntu3.1      | eoan-updates
+    valgrind | 1:3.15.0-1ubuntu5        | focal

  [Original Description]

  I'm performing Valgrind testing on an ElPotato running Ubuntu Bionic
  Aarch64 image. My program is dying like in
  https://bugs.kde.org/show_bug.cgi?id=381556 :

  ```
  $ valgrind --track-origins=yes --suppressions=cryptopp.supp ./cryptest.exe v
  ==12969== Memcheck, a memory error detector
  ==12969== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
  ==12969== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
  ==12969== Command: ./cryptest.exe v
  ==12969==
  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==12969== valgrind: Unrecognised instruction at address 0x4014c90.
  ==12969==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==12969==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==12969==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==12969==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==12969==    by 0x4001B47: _dl_start (rtld.c:523)
  ==12969==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ...
  ```

  Here's a similar Red Hat issue report:
  https://bugzilla.redhat.com/show_bug.cgi?id=1467952 .

  Please pickup the patch in the 381556 bug report.

  -----

  $ lsb_release -rd
  Description:    Ubuntu 18.04.2 LTS
  Release:        18.04

  $ apt-cache policy valgrind
  valgrind:
    Installed: 1:3.13.0-2ubuntu2.1
    Candidate: 1:3.13.0-2ubuntu2.1
    Version table:
   *** 1:3.13.0-2ubuntu2.1 500
          500 http://ports.ubuntu.com bionic-updates/main arm64 Packages
          100 /var/lib/dpkg/status
       1:3.13.0-2ubuntu2 500
          500 http://ports.ubuntu.com bionic/main arm64 Packages

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to valgrind in Ubuntu.
https://bugs.launchpad.net/bugs/1826811

Title:
  Valgrind unhandled instruction 0xD5380000 on Aarch64

Status in valgrind package in Ubuntu:
  Fix Released
Status in valgrind source package in Bionic:
  In Progress
Status in valgrind package in Fedora:
  Fix Released

Bug description:
  [Impact]
  valgrind on bionic coredump and errors out as follows:

  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==11950== valgrind: Unrecognised instruction at address 0x4014c90.
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ==11950== Your program just tried to execute an instruction that Valgrind
  ==11950== did not recognise.  There are two possible reasons for this.
  ==11950== 1. Your program has a bug and erroneously jumped to a non-code
  ==11950==    location.  If you are running Memcheck and you just saw a
  ==11950==    warning about a bad jump, it's probably your program's fault.
  ==11950== 2. The instruction is legitimate but Valgrind doesn't handle it,
  ==11950==    i.e. it's Valgrind's fault.  If you think this is the case or
  ==11950==    you are not sure, please let us know and we'll try to fix it.
  ==11950== Either way, Valgrind will now raise a SIGILL signal which will
  ==11950== probably kill your program.
  ==11950==
  ==11950== Process terminating with default action of signal 4 (SIGILL)
  ==11950==  Illegal opcode at address 0x4014C90
  ==11950==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==11950==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==11950==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==11950==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==11950==    by 0x4001B47: _dl_start (rtld.c:523)
  ==11950==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)

  The crash occurs because Valgrind is trying to simulate the CPU
  instructions when debugging a specific process. Valgrind tries to
  disassemble the whole instructions running by the process and insert
  the debugging instructions in run time. However, in this case,
  Valgrind cannot identify the MIDR_EL1 flag which happens in the "mrs
  %0, midr_el1" instruction. And this instruction means to read the CPU
  ID state register to %0(id) variable. asm volatile ("mrs %0, midr_el1"
  : "=r"(id)); so, Valrind cannot recognize what "midr_el1" is and then
  crashes.

  https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt
  ....
  d) CPU Identification :
      MIDR_EL1 is exposed to help identify the processor. On a
      heterogeneous system, this could be racy (just like getcpu()). The
      process could be migrated to another CPU by the time it uses the
      register value, unless the CPU affinity is set. Hence, there is no
      guarantee that the value reflects the processor that it is
      currently executing on. The REVIDR is not exposed due to this
      constraint, as REVIDR makes sense only in conjunction with the
      MIDR. Alternately, MIDR_EL1 and REVIDR_EL1 are exposed via sysfs
      at:

  	/sys/devices/system/cpu/cpu$ID/regs/identification/
  	                                              \- midr
  	                                              \- revidr

  [Test Case]

  1) Write a 'Hello World' program:
  ----
  #include <stdio.h>

  void main(void) {
  printf("Hello World!\n");
  };
  ----

  2) Build it:
  $ cc -o hello hello.c

  3) Then run valgrind on it:
  $ valgrind ./hello

  [Regression Potential]

  For the regression possibility, it should be fine.

  The symtpom happens when Valgrind is trying to disassemble code inside
  glibc (sysdeps/unix/sysv/linux/aarch64/cpu-features.c):

  Even if the HWCAP_CPUID is not supported, the default value is to
  assign 0 to the midr variable. So, I think it's not an important
  feature to support.

  Additionally, the fix is found in Ubuntu already (disco and late).

  For some reasons, if a regression happens, the regression will be
  limited to ARM arch and shouldn't affect other cpu(s) architecture.

  [Other information]

  Upstream fix:
  https://sourceware.org/git/?p=valgrind.git;a=commit;h=fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42

  * Only affecting Bionic:

  # git describe --contains fbbb696c5d1e93d4ac6cb548c68bb3f443ceef42
  VALGRIND_3_14_0~96

  # rmadison valgrind
  => valgrind | 1:3.13.0-2ubuntu2.1      | bionic-updates
     valgrind | 1:3.14.0-2ubuntu6        | disco
     valgrind | 1:3.15.0-1ubuntu3.1      | eoan-updates
     valgrind | 1:3.15.0-1ubuntu5        | focal

  [Original Description]

  I'm performing Valgrind testing on an ElPotato running Ubuntu Bionic
  Aarch64 image. My program is dying like in
  https://bugs.kde.org/show_bug.cgi?id=381556 :

  ```
  $ valgrind --track-origins=yes --suppressions=cryptopp.supp ./cryptest.exe v
  ==12969== Memcheck, a memory error detector
  ==12969== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
  ==12969== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
  ==12969== Command: ./cryptest.exe v
  ==12969==
  ARM64 front end: branch_etc
  disInstr(arm64): unhandled instruction 0xD5380000
  disInstr(arm64): 1101'0101 0011'1000 0000'0000 0000'0000
  ==12969== valgrind: Unrecognised instruction at address 0x4014c90.
  ==12969==    at 0x4014C90: init_cpu_features (cpu-features.c:72)
  ==12969==    by 0x4014C90: dl_platform_init (dl-machine.h:208)
  ==12969==    by 0x4014C90: _dl_sysdep_start (dl-sysdep.c:231)
  ==12969==    by 0x40018C3: _dl_start_final (rtld.c:414)
  ==12969==    by 0x4001B47: _dl_start (rtld.c:523)
  ==12969==    by 0x40011C7: ??? (in /lib/aarch64-linux-gnu/ld-2.27.so)
  ...
  ```

  Here's a similar Red Hat issue report:
  https://bugzilla.redhat.com/show_bug.cgi?id=1467952 .

  Please pickup the patch in the 381556 bug report.

  -----

  $ lsb_release -rd
  Description:    Ubuntu 18.04.2 LTS
  Release:        18.04

  $ apt-cache policy valgrind
  valgrind:
    Installed: 1:3.13.0-2ubuntu2.1
    Candidate: 1:3.13.0-2ubuntu2.1
    Version table:
   *** 1:3.13.0-2ubuntu2.1 500
          500 http://ports.ubuntu.com bionic-updates/main arm64 Packages
          100 /var/lib/dpkg/status
       1:3.13.0-2ubuntu2 500
          500 http://ports.ubuntu.com bionic/main arm64 Packages

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/1826811/+subscriptions