APPLIED: [SRU][N][O][PATCH 0/1] By always inlining _compound_head(), clone() sees 3%+ performance increase
Roxana Nicolescu
roxana.nicolescu at canonical.com
Thu Nov 28 14:26:44 UTC 2024
On 22/11/2024 02:06, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2089327
>
> [Impact]
>
> _compound_head() is called frequently during clone() heavy workloads with
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile
> always inlining it for a slight 3%+ performance improvement during clone().
>
> Over the lifecycle of Noble, Oracular it could save significant amounts of
> CPU time during clone(), and save a large amount of electricity. We should
> always inline _compound_head() and take advantage of the performance boost.
>
> [Fix]
>
> This was fixed in 6.12-rc1 by:
>
> commit ef5f379de302884b9b7ad9b62587a942a9f0bb55
> Author: David Hildenbrand <david at redhat.com>
> Date: Tue Aug 20 14:22:10 2024 +0200
> Subject: mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55
>
> This commit is intended to offset the performance loss caused by:
>
> c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")
>
> which landed in 6.10-rc1, but the change is generic enough that Noble users
> would benefit from the fix as well. They bring both Noble and Oracular +3%.
>
> [Testcase]
>
> clone() heavy workloads are best to show the performance increase.
>
> Originally, the user who requested this is running an Ansible heavy workload,
> and finds that clone() bottlenecks during large runs of Ansible against
> thousands of containers and hosts.
>
> They benchmarked 6.8.0-49-generic against a patched test kernel of the same
> 6.8.0-49-generic and found:
>
> Before:
> 08:24:23: Rename subiquity netplan config
> 08:36:12: hostendpoint_monitoring: Create log directory (10990)
> = 11m49s
>
> 08:37:59: Rename subiquity netplan config
> 08:49:49: hostendpoint_monitoring: Create log directory (10991)
> = 11m50s
>
> After:
> 08:55:16: Rename subiquity netplan config
> 09:06:28: hostendpoint_monitoring: Create log directory (10991)
> = 11m12s
>
> 09:08:59: Rename subiquity netplan config
> 09:20:22: hostendpoint_monitoring: Create log directory (10991)
> = 11m23s
>
> Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up
> over thousands of hosts.
>
> I did some basic tests with stress-ng using the clone() stressor.
>
> I ran:
>
> $ sudo apt install stress-ng
> $ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
>
> Before:
> ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
> stress-ng: info: [953] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per
> stress-ng: info: [953] (secs) (secs) (secs) (real time) (usr+sys time) instance (%)
> stress-ng: info: [953] clone 19919 61.80 2.19 232.84 322.29 84.75 76.06
> stress-ng: info: [55777] clone 19540 61.17 1.75 229.32 319.42 84.56 75.55
> stress-ng: info: [107873] clone 19817 62.39 1.92 235.90 317.64 83.33 76.24
> stress-ng: info: [177572] clone 19763 60.57 0.89 226.55 326.27 86.89 75.10
>
> After:
> ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
> stress-ng: info: [914] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per
> stress-ng: info: [914] (secs) (secs) (secs) (real time) (usr+sys time) instance (%)
> stress-ng: info: [914] clone 19446 60.67 1.83 229.60 320.50 84.03 76.29
> stress-ng: info: [67984] clone 19600 60.63 0.90 226.66 323.26 86.13 75.06
> stress-ng: info: [117843] clone 19665 60.64 0.98 226.97 324.27 86.27 75.18
> stress-ng: info: [167831] clone 19306 61.22 1.20 227.39 315.38 84.46 74.68
>
> These numbers are a bit more fuzzy, but its about 3% extra bogo ops.
>
> There is a test kernel available in the below ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test
>
> If you install it, you too will get 3%+ performance improvement on clone() heavy
> workloads.
>
> [Where problems could occur]
>
> We are inlining a hotly used function in the clone() syscall callpath. This
> should technically increase the performance due to not having to context switch
> between calls to _compound_head(), without much of a downside, apart from
> slightly increased binary size, and the inability to livepatch the function.
>
> I checked on cscope, and _compound_head is called from:
>
> compound_head()
> page_folio()
>
> both in page-flags.h as #defines. This is going to have a minuscule footprint
> change.
>
> The risk of regression is well worth the 3%+ performance gain.
>
> David Hildenbrand (1):
> mm: always inline _compound_head() with
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>
> include/linux/page-flags.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Applied to oracular:linux, noble:linux master-next branches. Thanks!
More information about the kernel-team
mailing list