[Bug 1663280] Re: Serious performance degradation of math functions

Fri Oct 27 08:12:00 UTC 2017

Launchpad has imported 10 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=1421121.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2017-02-10T12:09:55+00:00 Oleg wrote:

Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25
[2]. Fedora 24 and Fedora 25 are affected because they use either Glibc
2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of
math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan,
asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by
libm. Bug can be reproduced on any AVX-capable x86-64 machine.

This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM
registers used by AVX-256 instructions extend 128-bit registers used by
SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE
instruction after AVX-256 instruction it has to store upper half of the
YMM register to the internal buffer and then restore it when execution
returns back to AVX instructions. Store/restore is required because old-
fashioned SSE knows nothing about the upper halves of its registers and
may damage them. This store/restore operation is time consuming (several
tens of clock cycles for each operation). To deal with this issue, Intel
introduced AVX-128 instructions which operate on the same 128-bit XMM
register as SSE but take into account upper halves of YMM registers.
Hence, no store/restore required. Practically speaking, AVX-128
instructions is a new smart form of SSE instructions which can be used
together with full-size AVX-256 instructions without any penalty. Intel
recommends to use AVX-128 instructions instead of SSE instructions
wherever possible. To sum things up, it's okay to mix SSE with AVX-128
and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because
both types of instructions are aware of 256-bit YMM registers. Mixing
SSE with AVX-128 is okay because CPU can guarantee that the upper halves
of YMM registers don't contain any meaningful data (how one can put it
there without using AVX-256 instructions) and avoid doing store/restore
operation (why to care about random trash in the upper halves of the YMM
registers). It's not okay to mix SSE with AVX-256 due to the transition
penalty. Scalar floating-point instructions used by routines mentioned
above are implemented as a subset of SSE and AVX-128 instructions. They
operate on a small fraction of 128-bit register but still considered
SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty
as well.

Glibc inadvertently triggers a chain of AVX/SSE transition penalties due
to inappropriate use of AVX-256 instructions inside
_dl_runtime_resolve() procedure. By using AVX-256 instructions to
push/pop YMM registers, Glibc makes CPU think that the upper halves of
XMM registers contain meaningful data which needs to be preserved during
execution of SSE instructions. With such a 'dirty' flag set every switch
between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time
consuming store/restore procedure. This 'dirty' flag never gets cleared
during the whole program execution which leads to a serious overall
slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure
tries to avoid using AVX-256 instructions if possible.

Buggy _dl_runtime_resolve() gets called every time when dynamic linker
tries to resolve a symbol (any symbol, not just ones mentioned above).
It's enough for _dl_runtime_resolve() to be called just once to touch
the upper halves of the YMM registers and provoke AVX/SSE transition
penalty in the future. It's safe to say that all dynamically linked
application call _dl_runtime_resolve() at least once which means that
all of them may experience slowdown. Performance degradation takes place
when such application mixes AVX and SSE instructions (switches from AVX
to SSE or back).

There are two types of math routines provided by libm:
(a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other)
(b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others)

For the former group of routines slowdown happens when they get called
from SSE code (i.e. from the application compiled with -mno-avx) because
SSE -> AVX transition takes place. For the latter one slowdown happens
when routines get called from AVX code (i.e. from the application
compiled with -mavx) because AVX -> SSE transition takes place. Both
situations look realistic. SSE code gets generated by gcc to target
x86-64 and AVX-optimized code gets generated by gcc -march=native on
AVX-capable machines.

============================================================================

Let's take one routine from the group (a) and try to reproduce the
slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
  printf("%f\n", a);
  return 0;
}

$ gcc -O3 -march=x86-64 -o exp exp.c -lm

$ time ./exp
<..> 2.801s <..>

$ time LD_BIND_NOW=1 ./exp
<..> 0.660s <..>

You can see that application demonstrates 4x better performance when
_dl_runtime_resolve() doesn't get called. That's how serious the impact
of AVX/SSE transition can be.

============================================================================

Let's take one routine from the group (b) and try to reproduce the
slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
  printf("%f\n", a);
  return 0;
}

# note that -mavx option has been passed
$ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

$ time ./pow
<..> 4.157s <..>

$ time LD_BIND_NOW=1 ./pow
<..> 2.123s <..>

You can see that application demonstrates 2x better performance when
_dl_runtime_resolve() doesn't get called.

============================================================================

[!] It's important to mention that the context of this bug might be even
wider. After a call to buggy _dl_runtime_resolve() any transition
between AVX-128 and SSE (otherwise legitimate) will suffer from
performance degradation. Any application which mixes AVX-128 floating
point code with SSE floating point code (e.g. by using external SSE-only
library) will experience serious slowdown.

[0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
[3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/3

------------------------------------------------------------------------
On 2017-02-10T14:10:06+00:00 Carlos wrote:

I don't see anywhere near the performance degradation you're seeing, so
it must be heavily dependent on the family and stepping that you're
using.

e.g.
[carlos at athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test
154964150.331550

real	0m1.831s
user	0m1.820s
sys	0m0.003s

[carlos at athas rhbz1421121]$ time ./pow-test
154964150.331550

real	0m1.830s
user	0m1.820s
sys	0m0.001s

Verified pow-test built without DT_FLAGS BIND_NOW.

I agree that it is less than optimal to have processor state transitions
like those you indicate for every time the dynamic loader trampoline is
called.

We'll look into this.

Fedora 26 will not have this problem since it's based on glibc 2.25 with
the fix you indicate already present.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/4

------------------------------------------------------------------------
On 2017-02-10T14:38:55+00:00 Oleg wrote:

Hi Carlos,

Many thanks for looking into this! Could you please confirm that you
used the following command to compile pow test with gcc:

$ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

Passing -mavx is the key thing for this example to work as expected. You
want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx.

I'd also appreciate if you tell me on which CPU you do testing. It's
impossible for me run this test on every possible CPU (tried on Sandy
Bridge and Ivy Bridge machines so far) and this information would be
really helpful.

Thanks!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/5

------------------------------------------------------------------------
On 2017-02-10T14:49:23+00:00 Carlos wrote:

(In reply to Oleg Strikov from comment #4)
> Hi Carlos,
> 
> Many thanks for looking into this! Could you please confirm that you used
> the following command to compile pow test with gcc:
> 
> $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

I can confirm that I used these options on an F25 system.

The dynamic loader trampoline is only called once in the loop to resolve
the singular math function call, and after that it's the same sequence
over and over again without any explicit software save/restore (though
the CPU might do something for the transition).

carlos at athas rhbz1421121]$ gcc -O3 -march=x86-64 -mavx -o pow-test pow-test.c -lm
[carlos at athas rhbz1421121]$ time ./pow-test
154964150.331550

real	0m1.829s
user	0m1.819s
sys	0m0.002s
[carlos at athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test
154964150.331550

real	0m1.833s
user	0m1.819s
sys	0m0.005s

gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC)

> Passing -mavx is the key thing for this example to work as expected. You
> want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx.
> 
> I'd also appreciate if you tell me on which CPU you do testing. It's
> impossible for me run this test on every possible CPU (tried on Sandy Bridge
> and Ivy Bridge machines so far) and this information would be really helpful.

I ran this on an i5-4690K, so a Haswell series CPU, but without AVX512.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/6

------------------------------------------------------------------------
On 2017-02-10T14:54:20+00:00 Florian wrote:

(In reply to Carlos O'Donell from comment #5)
> (In reply to Oleg Strikov from comment #4)
> > Hi Carlos,
> > 
> > Many thanks for looking into this! Could you please confirm that you used
> > the following command to compile pow test with gcc:
> > 
> > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm
> 
> I can confirm that I used these options on an F25 system.
> 
> The dynamic loader trampoline is only called once in the loop to resolve the
> singular math function call, and after that it's the same sequence over and
> over again without any explicit software save/restore (though the CPU might
> do something for the transition).

Right, that's why I found the claim about the substantial performance
impact always a bit puzzling.

What happens if you use LD_BIND_NOT=1?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/7

------------------------------------------------------------------------
On 2017-02-10T15:09:47+00:00 Carlos wrote:

(In reply to Florian Weimer from comment #6)
> (In reply to Carlos O'Donell from comment #5)
> > (In reply to Oleg Strikov from comment #4)
> > > Hi Carlos,
> > > 
> > > Many thanks for looking into this! Could you please confirm that you used
> > > the following command to compile pow test with gcc:
> > > 
> > > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm
> > 
> > I can confirm that I used these options on an F25 system.
> > 
> > The dynamic loader trampoline is only called once in the loop to resolve the
> > singular math function call, and after that it's the same sequence over and
> > over again without any explicit software save/restore (though the CPU might
> > do something for the transition).
> 
> Right, that's why I found the claim about the substantial performance impact
> always a bit puzzling.

Agreed.

> What happens if you use LD_BIND_NOT=1?

[carlos at athas rhbz1421121]$ time LD_BIND_NOT=1 ./pow-test
154964150.331550

real	0m4.527s
user	0m4.505s
sys	0m0.003s

Terrible performance as expected though.

Surprisingly inline with Oleg's numbers.

However, LD_BIND_NOT performance is never the default, you'd have to be
running with a preloaded audit library (LD_AUDIT) to trigger that kind
of behaviour.

Perhaps something is wrong with Oleg's system configuration?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/8

------------------------------------------------------------------------
On 2017-02-10T15:35:50+00:00 Oleg wrote:

To my understanding, once trampoline touched upper halves of YMM
registers ALL future switches between AVX and SSE require time consuming
store/restore operation (i. e. all future calls to pow will suffer).
Touching upper halves sets somewhat like a dirty flag (which forces cpu
to do store/restore) and this flag never gets dropped during the whole
program execution.  That's why impact is so serious.

I was able to reproduce the issue using f25 live cd. So it looks like a
cpu model depending issue. We were able to repro on E5-1630 (haswell)
though.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/9

------------------------------------------------------------------------
On 2017-02-10T16:37:16+00:00 Marcel wrote:

Hi, I'm the one that Oleg referred to who had this issue on an E5-1630 CPU. It turns out, that I actually /cannot/ reproduce it with a Fedora 25 live CD (before and after an update of glibc)!
I don't normally use Fedora on this machine, I originally encountered the problem with Ubuntu 16.04 (which has glibc 2.23 and not 2.24 as Fedora 25) -- there it is perfectly reproducible with Oleg's code, with very similar timings to the ones that Oleg reported.

This is very confusing, I can try with a Fedora 24 live CD as well, but
Oleg seems to be able to reproduce it on Fedora 25, so...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/10

------------------------------------------------------------------------
On 2017-02-10T16:57:45+00:00 Marcel wrote:

Um, sorry for the noise, but it seems that the bug was fixed with
Fedora's glibc 2.24-4 release:

* Fri Dec 23 2016 Carlos O'Donell <carlos at ...> - 2.24-4
  - Auto-sync with upstream release/2.24/master,
    commit e9e69e468039fcd57276f783a16aa771a8e4214e, fixing:
  - [...]
  - Fix runtime resolver routines in the presence of AVX512 (swbz#20508)
  - [...]

That would explain why Oleg saw it with the Fedora 25 live CD (which
still has 2.24-3) while Carlos did not see it on his system. Now what I
don't understand is why I myself could not reproduce with the live CD,
even though I tried compiling/running it before updating glibc...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/11

------------------------------------------------------------------------
On 2017-02-11T13:00:19+00:00 Oleg wrote:

I just rerun all the tests again on F24 and F25. I can confirm that the
performance issue disappears on F25 when glibc package gets updated to
version 2.24-4. It is still observable on F24 because the fix has not
been propagated there. I'm very sorry for such a stupid mistake (not
updating livecd packages before running tests). Thanks to Marcel for
pointing that out, it saved me huge amount of time.

We also did some kind of investigation regarding specific CPU models
which suffer from such kind of performance degradation. Quite reliable
source [1] says that 'AMD processors and later Intel processors (Skylake
and Knights Landing) do not have such a state switch'. It means that
only Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected.

Many thanks to Carlos and Florian for such fast and straight to the
point response. I really appreciate that.

[1] http://www.agner.org/optimize/blog/read.php?i=761#761

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280/comments/12

** Changed in: glibc (Fedora)
       Status: Unknown => Fix Released

** Changed in: glibc (Fedora)
   Importance: Unknown => Undecided

** Bug watch added: Sourceware.org Bugzilla #20495
   https://sourceware.org/bugzilla/show_bug.cgi?id=20495

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1663280

Title:
  Serious performance degradation of math functions

Status in GLibC:
  Fix Released
Status in glibc package in Ubuntu:
  Triaged
Status in glibc source package in Zesty:
  Triaged
Status in glibc package in Fedora:
  Fix Released

Bug description:
  Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25
  [2]. All Ubuntu versions starting from 16.04 are affected because they
  use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x)
  performance degradation of math functions (pow, exp/exp2/exp10,
  log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2,
  sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be
  reproduced on any AVX-capable x86-64 machine.

  @strikov: According to a quite reliable source [5] all AMD CPUs and
  latest Intel CPUs (Skylake and Knights Landing) don't suffer from
  AVX/SSE transition penalty. It means that the scope of this bug
  becomes smaller and includes only the following generations of Intel's
  CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still
  remains quite large though.

  @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix
  from upstream 2.24 branch (as Marcel pointed out, fix has been
  backported to 2.24 branch where Fedora took it successfully) if such
  synchronization will take place. Ubuntu 16.04 (the main target of this
  bug) uses Glibc 2.23 which hasn't been patched upstream and will
  suffer from performance degradation until we fix it manually.

  This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM
  registers used by AVX-256 instructions extend 128-bit registers used
  by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes
  SSE instruction after AVX-256 instruction it has to store upper half
  of the YMM register to the internal buffer and then restore it when
  execution returns back to AVX instructions. Store/restore is required
  because old-fashioned SSE knows nothing about the upper halves of its
  registers and may damage them. This store/restore operation is time
  consuming (several tens of clock cycles for each operation). To deal
  with this issue, Intel introduced AVX-128 instructions which operate
  on the same 128-bit XMM register as SSE but take into account upper
  halves of YMM registers. Hence, no store/restore required. Practically
  speaking, AVX-128 instructions is a new smart form of SSE instructions
  which can be used together with full-size AVX-256 instructions without
  any penalty. Intel recommends to use AVX-128 instructions instead of
  SSE instructions wherever possible. To sum things up, it's okay to mix
  SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256
  is allowed because both types of instructions are aware of 256-bit YMM
  registers. Mixing SSE with AVX-128 is okay because CPU can guarantee
  that the upper halves of YMM registers don't contain any meaningful
  data (how one can put it there without using AVX-256 instructions) and
  avoid doing store/restore operation (why to care about random trash in
  the upper halves of the YMM registers). It's not okay to mix SSE with
  AVX-256 due to the transition penalty. Scalar floating-point
  instructions used by routines mentioned above are implemented as a
  subset of SSE and AVX-128 instructions. They operate on a small
  fraction of 128-bit register but still considered SSE/AVX-128
  instruction. And they suffer from SSE/AVX transition penalty as well.

  Glibc inadvertently triggers a chain of AVX/SSE transition penalties
  due to inappropriate use of AVX-256 instructions inside
  _dl_runtime_resolve() procedure. By using AVX-256 instructions to
  push/pop YMM registers [4], Glibc makes CPU think that the upper
  halves of XMM registers contain meaningful data which needs to be
  preserved during execution of SSE instructions. With such a 'dirty'
  flag set every switch between SSE and AVX instructions (AVX-128 or
  AVX-256) leads to a time consuming store/restore procedure. This
  'dirty' flag never gets cleared during the whole program execution
  which leads to a serious overall slowdown. Fixed implementation [2] of
  _dl_runtime_resolve() procedure tries to avoid using AVX-256
  instructions if possible.

  Buggy _dl_runtime_resolve() gets called every time when dynamic linker
  tries to resolve a symbol (any symbol, not just ones mentioned above).
  It's enough for _dl_runtime_resolve() to be called just once to touch
  the upper halves of the YMM registers and provoke AVX/SSE transition
  penalty in the future. It's safe to say that all dynamically linked
  application call _dl_runtime_resolve() at least once which means that
  all of them may experience slowdown. Performance degradation takes
  place when such application mixes AVX and SSE instructions (switches
  from AVX to SSE or back).

  There are two types of math routines provided by libm:
  (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other)
  (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others)

  For the former group of routines slowdown happens when they get called
  from SSE code (i.e. from the application compiled with -mno-avx)
  because SSE -> AVX transition takes place. For the latter one slowdown
  happens when routines get called from AVX code (i.e. from the
  application compiled with -mavx) because AVX -> SSE transition takes
  place. Both situations look realistic. SSE code gets generated by gcc
  to target x86-64 and AVX-optimized code gets generated by gcc
  -march=native on AVX-capable machines.

  ============================================================================

  Let's take one routine from the group (a) and try to reproduce the
  slowdown.

  #include <math.h>
  #include <stdio.h>

  int main () {
    double a, b;
    for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
    printf("%f\n", a);
    return 0;
  }

  $ gcc -O3 -march=x86-64 -o exp exp.c -lm

  $ time ./exp
  <..> 2.801s <..>

  $ time LD_BIND_NOW=1 ./exp
  <..> 0.660s <..>

  You can see that application demonstrates 4x better performance when
  _dl_runtime_resolve() doesn't get called. That's how serious the
  impact of AVX/SSE transition can be.

  ============================================================================

  Let's take one routine from the group (b) and try to reproduce the
  slowdown.

  #include <math.h>
  #include <stdio.h>

  int main () {
    double a, b;
    for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
    printf("%f\n", a);
    return 0;
  }

  # note that -mavx option has been passed
  $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

  $ time ./pow
  <..> 4.157s <..>

  $ time LD_BIND_NOW=1 ./pow
  <..> 2.123s <..>

  You can see that application demonstrates 2x better performance when
  _dl_runtime_resolve() doesn't get called.

  ============================================================================

  [!] It's important to mention that the context of this bug might be
  even wider. After a call to buggy _dl_runtime_resolve() any transition
  between AVX-128 and SSE (otherwise legitimate) will suffer from
  performance degradation. Any application which mixes AVX-128 floating
  point code with SSE floating point code (e.g. by using external SSE-
  only library) will experience serious slowdown.

  [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
  [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
  [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
  [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
  [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182
  [5] http://www.agner.org/optimize/blog/read.php?i=761#761

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1663280/+subscriptions