[Bug 2030515]

Thu Apr 4 10:36:35 UTC 2024

The release/2.39/master branch has been updated by Arjun Shankar
<arjun at sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aa4249266e9906c4bc833e4847f4d8feef59504f

commit aa4249266e9906c4bc833e4847f4d8feef59504f
Author: Adhemerval Zanella <adhemerval.zanella at linaro.org>
Date:   Thu Feb 8 10:08:38 2024 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)

    The REP MOVSB usage on memcpy/memmove does not show much performance
    improvement on Zen3/Zen4 cores compared to the vectorized loops.  Also,
    as from BZ 30994, if the source is aligned and the destination is not
    the performance can be 20x slower.

    The performance difference is noticeable with small buffer sizes, closer
    to the lower bounds limits when memcpy/memmove starts to use ERMS.  The
    performance of REP MOVSB is similar to vectorized instruction on the
    size limit (the L2 cache).  Also, there is no drawback to multiple cores
    sharing the cache.

    Checked on x86_64-linux-gnu on Zen3.
    Reviewed-by: H.J. Lu <hjl.tools at gmail.com>

    (cherry picked from commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e)

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/2030515

Title:
  Terrible memcpy performance on Zen 3 when using rep movsb

Status in GLibC:
  New
Status in glibc package in Ubuntu:
  New

Bug description:
  On CPUs that advertise FSRM (fast short rep movsb), glibc 2.35 uses
  REP MOVSB for memcpy for sizes above 2112 (up to some threshold that
  depends on the cache size). Unfortunately, it seems that Zen 3 (at
  least in the microcode we're running) is extremely slow at REP MOVSB
  when the data are not well-aligned.

  I've found this using a memcpy benchmark at https://github.com/ska-
  sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp
  (compiled with the adjacent Makefile). To demonstrate the issue, run

  ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

  This runs:
  - 2113-byte memory copies
  - 1,000,000 times per timing measurement
  - in memory allocated with mmap
  - with the source 0 bytes from the start of the page
  - with the destination 1 byte from the start of the page
  - on core 0.

  It reports about 3.2 GB/s. Change the -b argument to 2111 and it
  reports over 100 GB/s. So the REP MOVSB case is about 30× slower!

  This will most likely need to be reported and fixed upstream, but I'm
  reporting it to Ubuntu first since I don't know if Ubuntu has modified
  glibc in any way that would be significant.

  See also: https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: libc6 2.35-0ubuntu3.1
  ProcVersionSignature: Ubuntu 5.19.0-46.47~22.04.1-generic 5.19.17
  Uname: Linux 5.19.0-46-generic x86_64
  NonfreeKernelModules: nvidia_modeset nvidia
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  CasperMD5CheckResult: unknown
  Date: Mon Aug  7 14:02:28 2023
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: glibc
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/2030515/+subscriptions