[Bug 1288935] [NEW] -march=native uses AVX instructions on platform without AVX, results in illegal instruction @ runtime
Robert C Jennings
1288935 at bugs.launchpad.net
Thu Mar 6 19:02:20 UTC 2014
Public bug reported:
Amazon EC2 hs2.8xlarge instances do not support AVX but gcc apparently
believes it should. Not sure if issue is gcc or the hypervisor, but
we're looking for someone to explain how gcc with "-march=native"
decides to use AVX instructions.
While running performance tests I found a test, stream
(http://www.cs.virginia.edu/stream/), which would crash with an illegal
instruction error. This was originally seen with the phoronix test
suite, but the upstream source for stream provides a simpler/smaller
testcase.
Environment is a paravirt instance-store instance of type hs1-8xlarge
(tested in us-east-1 with a daily precise build (details below)
The illegal instruction is CVTSI2SDL (Convert signed doubleword or
quadword integer to scalar double-precision floating-point value (AVX)).
Program is compiled with the "-march=native" flag to gcc, without this
we don't see the issue. This compiler flag causes the compiler to auto-
detect the CPU of the build computer.
The vcvtsi2sdl is an AVX instruction and is not supported on HS1 at this
time. The public documentation on supported processor features:
http://aws.amazon.com/ec2/instance-
types/#Instance_Type_Processor_Details has a bit of info, but I've also
included /proc/cpuinfo.
# Testcase
0 - Launch hs1.8xlarge instance type (us-east-1 with ubuntu-precise-daily-amd64-server-20140213.manifest.xml (ami-ef4b4f86) used for test)
1 - Get stream benchmark from http://www.cs.virginia.edu/stream/)
$ wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c
2 - demonstrate successful run without -march=native
$ gcc -g -O3 stream.c -o stream
$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 10561 microseconds.
(= 10561 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 9105.4 0.017592 0.017572 0.017621
Scale: 9318.1 0.017199 0.017171 0.017254
Add: 9542.7 0.025204 0.025150 0.025257
Triad: 9516.3 0.025263 0.025220 0.025307
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
3 - demonstrate failure
$ gcc -g -march=native stream.c -o stream
$ ulimit -c unlimited
$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Illegal instruction (core dumped)
$ gdb ./stream core
...
Core was generated by `./stream'.
Program terminated with signal 4, Illegal instruction.
#0 main () at stream.c:236
236 printf("Memory per array = %.1f MiB (= %.1f GiB).\n",
(gdb) disassemble /mr 0x0000000000400601,+10
Dump of assembler code from 0x400601 to 0x40060b:
236 printf("Memory per array = %.1f MiB (= %.1f GiB).\n",
=> 0x0000000000400601 <+109>: c5 fb 2a 45 fc vcvtsi2sdl -0x4(%rbp),%xmm0,%xmm0
0x0000000000400606 <+114>: c5 fb 10 0d 3a 16 00 00 vmovsd 0x163a(%rip),%xmm1 # 0x401c48
4 - Check capabilities presented by system (SSE2 is present)
$ cat /proc/cpuinfo |head -25
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping : 7
microcode : 0x70b
cpu MHz : 1999.999
cache size : 20480 KB
physical id : 1
siblings : 16
core id : 2
cpu cores : 1
apicid : 37
initial apicid : 37
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes avx hypervisor lahf_lm arat epb xsaveopt pln pts dtherm
bogomips : 3999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
** Affects: gcc-4.6 (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to gcc-4.6 in Ubuntu.
https://bugs.launchpad.net/bugs/1288935
Title:
-march=native uses AVX instructions on platform without AVX, results
in illegal instruction @ runtime
Status in “gcc-4.6” package in Ubuntu:
New
Bug description:
Amazon EC2 hs2.8xlarge instances do not support AVX but gcc apparently
believes it should. Not sure if issue is gcc or the hypervisor, but
we're looking for someone to explain how gcc with "-march=native"
decides to use AVX instructions.
While running performance tests I found a test, stream
(http://www.cs.virginia.edu/stream/), which would crash with an
illegal instruction error. This was originally seen with the phoronix
test suite, but the upstream source for stream provides a
simpler/smaller testcase.
Environment is a paravirt instance-store instance of type hs1-8xlarge
(tested in us-east-1 with a daily precise build (details below)
The illegal instruction is CVTSI2SDL (Convert signed doubleword or
quadword integer to scalar double-precision floating-point value
(AVX)). Program is compiled with the "-march=native" flag to gcc,
without this we don't see the issue. This compiler flag causes the
compiler to auto-detect the CPU of the build computer.
The vcvtsi2sdl is an AVX instruction and is not supported on HS1 at
this time. The public documentation on supported processor features:
http://aws.amazon.com/ec2/instance-
types/#Instance_Type_Processor_Details has a bit of info, but I've
also included /proc/cpuinfo.
# Testcase
0 - Launch hs1.8xlarge instance type (us-east-1 with ubuntu-precise-daily-amd64-server-20140213.manifest.xml (ami-ef4b4f86) used for test)
1 - Get stream benchmark from http://www.cs.virginia.edu/stream/)
$ wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c
2 - demonstrate successful run without -march=native
$ gcc -g -O3 stream.c -o stream
$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 10561 microseconds.
(= 10561 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 9105.4 0.017592 0.017572 0.017621
Scale: 9318.1 0.017199 0.017171 0.017254
Add: 9542.7 0.025204 0.025150 0.025257
Triad: 9516.3 0.025263 0.025220 0.025307
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
3 - demonstrate failure
$ gcc -g -march=native stream.c -o stream
$ ulimit -c unlimited
$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Illegal instruction (core dumped)
$ gdb ./stream core
...
Core was generated by `./stream'.
Program terminated with signal 4, Illegal instruction.
#0 main () at stream.c:236
236 printf("Memory per array = %.1f MiB (= %.1f GiB).\n",
(gdb) disassemble /mr 0x0000000000400601,+10
Dump of assembler code from 0x400601 to 0x40060b:
236 printf("Memory per array = %.1f MiB (= %.1f GiB).\n",
=> 0x0000000000400601 <+109>: c5 fb 2a 45 fc vcvtsi2sdl -0x4(%rbp),%xmm0,%xmm0
0x0000000000400606 <+114>: c5 fb 10 0d 3a 16 00 00 vmovsd 0x163a(%rip),%xmm1 # 0x401c48
4 - Check capabilities presented by system (SSE2 is present)
$ cat /proc/cpuinfo |head -25
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping : 7
microcode : 0x70b
cpu MHz : 1999.999
cache size : 20480 KB
physical id : 1
siblings : 16
core id : 2
cpu cores : 1
apicid : 37
initial apicid : 37
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes avx hypervisor lahf_lm arat epb xsaveopt pln pts dtherm
bogomips : 3999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-4.6/+bug/1288935/+subscriptions
More information about the foundations-bugs
mailing list