[SRU][N][O][PATCH 0/1] By always inlining _compound_head(), clone() sees 3%+ performance increase
Matthew Ruffell
matthew.ruffell at canonical.com
Fri Nov 22 01:06:35 UTC 2024
BugLink: https://bugs.launchpad.net/bugs/2089327
[Impact]
_compound_head() is called frequently during clone() heavy workloads with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile
always inlining it for a slight 3%+ performance improvement during clone().
Over the lifecycle of Noble, Oracular it could save significant amounts of
CPU time during clone(), and save a large amount of electricity. We should
always inline _compound_head() and take advantage of the performance boost.
[Fix]
This was fixed in 6.12-rc1 by:
commit ef5f379de302884b9b7ad9b62587a942a9f0bb55
Author: David Hildenbrand <david at redhat.com>
Date: Tue Aug 20 14:22:10 2024 +0200
Subject: mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55
This commit is intended to offset the performance loss caused by:
c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")
which landed in 6.10-rc1, but the change is generic enough that Noble users
would benefit from the fix as well. They bring both Noble and Oracular +3%.
[Testcase]
clone() heavy workloads are best to show the performance increase.
Originally, the user who requested this is running an Ansible heavy workload,
and finds that clone() bottlenecks during large runs of Ansible against
thousands of containers and hosts.
They benchmarked 6.8.0-49-generic against a patched test kernel of the same
6.8.0-49-generic and found:
Before:
08:24:23: Rename subiquity netplan config
08:36:12: hostendpoint_monitoring: Create log directory (10990)
= 11m49s
08:37:59: Rename subiquity netplan config
08:49:49: hostendpoint_monitoring: Create log directory (10991)
= 11m50s
After:
08:55:16: Rename subiquity netplan config
09:06:28: hostendpoint_monitoring: Create log directory (10991)
= 11m12s
09:08:59: Rename subiquity netplan config
09:20:22: hostendpoint_monitoring: Create log directory (10991)
= 11m23s
Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up
over thousands of hosts.
I did some basic tests with stress-ng using the clone() stressor.
I ran:
$ sudo apt install stress-ng
$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
Before:
ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
stress-ng: info: [953] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per
stress-ng: info: [953] (secs) (secs) (secs) (real time) (usr+sys time) instance (%)
stress-ng: info: [953] clone 19919 61.80 2.19 232.84 322.29 84.75 76.06
stress-ng: info: [55777] clone 19540 61.17 1.75 229.32 319.42 84.56 75.55
stress-ng: info: [107873] clone 19817 62.39 1.92 235.90 317.64 83.33 76.24
stress-ng: info: [177572] clone 19763 60.57 0.89 226.55 326.27 86.89 75.10
After:
ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
stress-ng: info: [914] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per
stress-ng: info: [914] (secs) (secs) (secs) (real time) (usr+sys time) instance (%)
stress-ng: info: [914] clone 19446 60.67 1.83 229.60 320.50 84.03 76.29
stress-ng: info: [67984] clone 19600 60.63 0.90 226.66 323.26 86.13 75.06
stress-ng: info: [117843] clone 19665 60.64 0.98 226.97 324.27 86.27 75.18
stress-ng: info: [167831] clone 19306 61.22 1.20 227.39 315.38 84.46 74.68
These numbers are a bit more fuzzy, but its about 3% extra bogo ops.
There is a test kernel available in the below ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test
If you install it, you too will get 3%+ performance improvement on clone() heavy
workloads.
[Where problems could occur]
We are inlining a hotly used function in the clone() syscall callpath. This
should technically increase the performance due to not having to context switch
between calls to _compound_head(), without much of a downside, apart from
slightly increased binary size, and the inability to livepatch the function.
I checked on cscope, and _compound_head is called from:
compound_head()
page_folio()
both in page-flags.h as #defines. This is going to have a minuscule footprint
change.
The risk of regression is well worth the 3%+ performance gain.
David Hildenbrand (1):
mm: always inline _compound_head() with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
include/linux/page-flags.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--
2.45.2
More information about the kernel-team
mailing list