[Bug 2030515]
Cvs-commit
2030515 at bugs.launchpad.net
Thu Apr 4 10:36:35 UTC 2024
The release/2.39/master branch has been updated by Arjun Shankar
<arjun at sourceware.org>:
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aa4249266e9906c4bc833e4847f4d8feef59504f
commit aa4249266e9906c4bc833e4847f4d8feef59504f
Author: Adhemerval Zanella <adhemerval.zanella at linaro.org>
Date: Thu Feb 8 10:08:38 2024 -0300
x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
The REP MOVSB usage on memcpy/memmove does not show much performance
improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
as from BZ 30994, if the source is aligned and the destination is not
the performance can be 20x slower.
The performance difference is noticeable with small buffer sizes, closer
to the lower bounds limits when memcpy/memmove starts to use ERMS. The
performance of REP MOVSB is similar to vectorized instruction on the
size limit (the L2 cache). Also, there is no drawback to multiple cores
sharing the cache.
Checked on x86_64-linux-gnu on Zen3.
Reviewed-by: H.J. Lu <hjl.tools at gmail.com>
(cherry picked from commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e)
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/2030515
Title:
Terrible memcpy performance on Zen 3 when using rep movsb
Status in GLibC:
New
Status in glibc package in Ubuntu:
New
Bug description:
On CPUs that advertise FSRM (fast short rep movsb), glibc 2.35 uses
REP MOVSB for memcpy for sizes above 2112 (up to some threshold that
depends on the cache size). Unfortunately, it seems that Zen 3 (at
least in the microcode we're running) is extremely slow at REP MOVSB
when the data are not well-aligned.
I've found this using a memcpy benchmark at https://github.com/ska-
sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp
(compiled with the adjacent Makefile). To demonstrate the issue, run
./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0
This runs:
- 2113-byte memory copies
- 1,000,000 times per timing measurement
- in memory allocated with mmap
- with the source 0 bytes from the start of the page
- with the destination 1 byte from the start of the page
- on core 0.
It reports about 3.2 GB/s. Change the -b argument to 2111 and it
reports over 100 GB/s. So the REP MOVSB case is about 30× slower!
This will most likely need to be reported and fixed upstream, but I'm
reporting it to Ubuntu first since I don't know if Ubuntu has modified
glibc in any way that would be significant.
See also: https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/
ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: libc6 2.35-0ubuntu3.1
ProcVersionSignature: Ubuntu 5.19.0-46.47~22.04.1-generic 5.19.17
Uname: Linux 5.19.0-46-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
Date: Mon Aug 7 14:02:28 2023
RebootRequiredPkgs: Error: path contained symlinks.
SourcePackage: glibc
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/2030515/+subscriptions
More information about the foundations-bugs
mailing list