[Bug 1928508] Re: Performance regression on memcpy() calls for AMD Zen

Mon Dec 13 06:03:40 UTC 2021

Hello Heitor, or anyone else affected,

Accepted glibc into focal-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/glibc/2.31-0ubuntu9.4
in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: glibc (Ubuntu Focal)
       Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-focal

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1928508

Title:
  Performance regression on memcpy() calls for AMD Zen

Status in glibc package in Ubuntu:
  Fix Released
Status in glibc source package in Focal:
  Fix Committed
Status in glibc source package in Groovy:
  Won't Fix

Bug description:
  [Impact]
  On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated.

  Before 'glibc-2.33~455', cache values were calculated taking into
  consideration the number of hardware threads in the CPU. On AMD Ryzen
  and EPYC systems, this can be counter-productive if the number of
  threads is high enough for the last-level caches to "overrun" each
  other and cause cache line flushes. The solution is to reduce the
  allocated size for these non_temporal stores, removing the number of
  threads from the equation.

  [Test Plan]
  Compile the test_memcpy.c that is attached to this bug report:

  $ gcc -mtune=generic -march=x86-64 -g -O3 test_memcpy.c -o
  test_memcpy64

  This should be run before and after installing the libc packages from
  proposed. On Ryzen and EPYC systems a substantial improvement should
  be seen and on other systems, no significant change should be seen.

  [Where problems could occur]
  Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments.

  [Other Info]
  This issue has been fixed by the following upstream commit:
  - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold)

  $ git describe --contains d3c57027470b
  glibc-2.33~455
  $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute
   glibc | 2.31-0ubuntu9   | focal           | source
   glibc | 2.31-0ubuntu9.2 | focal-updates   | source
   glibc | 2.32-0ubuntu3   | groovy          | source
   glibc | 2.32-0ubuntu3.2 | groovy-proposed | source
   glibc | 2.33-0ubuntu5   | hirsute         | source

  Affected releases include Ubuntu Focal and Groovy. Bionic is not
  affected, and releases starting with Hirsute already ship the upstream
  patch to fix this regression.

  glibc exports this specific variable as a tunable, so we could also tweak it with the GLIBC_TUNABLES env var:
  $ hyperfine -n clean-env 'lxc exec focal env ./test_memcpy64 32' -n tunables 'lxc exec focal env GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=1024*1024*3*4 ./test_memcpy64 32'
  Benchmark #1: clean-env
    Time (mean ± σ):      2.529 s ±  0.061 s    [User: 6.0 ms, System: 4.7 ms]
    Range (min … max):    2.457 s …  2.615 s    10 runs

  Benchmark #2: tunables
    Time (mean ± σ):      1.427 s ±  0.030 s    [User: 6.5 ms, System: 3.8 ms]
    Range (min … max):    1.402 s …  1.482 s    10 runs

  Summary
    'tunables' ran
      1.77 ± 0.06 times faster than 'clean-env'

  This solution is not ideal, but it offers a secondary way of fixing
  the performance issues. However, the speed gains for memcpy() are
  noticeable enough that we should strongly consider changing the
  defaults in the Focal LTS release, so that it performs similarly to
  Bionic and future Ubuntu releases starting with Hirsute.

  [old test case section]
  Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0].
  This test program was compiled with gcc 10.2.0, using the following flags:
  $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64

  Tests were performed with the following criteria:
  - use 32Mb buffers ("./test_memcpy64 32")
  - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically
  - benchmark with at least 10 runs in the same environment, to minimize variance
  - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other

  Below is a comparison between two Focal containers, leveraging LXD to
  make use of different libc versions on the same host:

  $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32'
  Benchmark #1: libc-2.31-0ubuntu9.2
    Time (mean ± σ):      2.723 s ±  0.013 s    [User: 4.7 ms, System: 5.1 ms]
    Range (min … max):    2.693 s …  2.735 s    10 runs

  Benchmark #2: libc-patched
    Time (mean ± σ):      1.522 s ±  0.004 s    [User: 3.9 ms, System: 5.6 ms]
    Range (min … max):    1.515 s …  1.528 s    10 runs

  Summary
    'libc-patched' ran
      1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2'
  $ head -n5 /proc/cpuinfo
  processor       : 0
  vendor_id       : AuthenticAMD
  cpu family      : 23
  model           : 113
  model name      : AMD Ryzen 7 3700X 8-Core Processor

  [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670
  [1] https://github.com/sharkdp/hyperfine/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+subscriptions