[SRU][N][O][PATCH 0/1] By always inlining _compound_head(), clone() sees 3%+ performance increase

Fri Nov 22 01:06:35 UTC 2024

BugLink: https://bugs.launchpad.net/bugs/2089327

[Impact]

_compound_head() is called frequently during clone() heavy workloads with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile
always inlining it for a slight 3%+ performance improvement during clone().

Over the lifecycle of Noble, Oracular it could save significant amounts of
CPU time during clone(), and save a large amount of electricity. We should
always inline _compound_head() and take advantage of the performance boost.

[Fix]

This was fixed in 6.12-rc1 by:

commit ef5f379de302884b9b7ad9b62587a942a9f0bb55
Author: David Hildenbrand <david at redhat.com>
Date:  Tue Aug 20 14:22:10 2024 +0200
Subject: mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55

This commit is intended to offset the performance loss caused by:

 c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")

which landed in 6.10-rc1, but the change is generic enough that Noble users
would benefit from the fix as well. They bring both Noble and Oracular +3%.

[Testcase]

clone() heavy workloads are best to show the performance increase. 

Originally, the user who requested this is running an Ansible heavy workload,
and finds that clone() bottlenecks during large runs of Ansible against
thousands of containers and hosts.

They benchmarked 6.8.0-49-generic against a patched test kernel of the same
6.8.0-49-generic and found:

Before:
    08:24:23: Rename subiquity netplan config
    08:36:12: hostendpoint_monitoring: Create log directory (10990)
    = 11m49s

    08:37:59: Rename subiquity netplan config
    08:49:49: hostendpoint_monitoring: Create log directory (10991)
    = 11m50s

After:
    08:55:16: Rename subiquity netplan config
    09:06:28: hostendpoint_monitoring: Create log directory (10991)
    = 11m12s

    09:08:59: Rename subiquity netplan config
    09:20:22: hostendpoint_monitoring: Create log directory (10991)
    = 11m23s

Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up
over thousands of hosts.

I did some basic tests with stress-ng using the clone() stressor.

I ran:

$ sudo apt install stress-ng
$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics

Before:
ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
stress-ng: info:  [953] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per
stress-ng: info:  [953]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)
stress-ng: info:  [953] clone             19919     61.80      2.19    232.84       322.29          84.75        76.06
stress-ng: info:  [55777] clone             19540     61.17      1.75    229.32       319.42          84.56        75.55
stress-ng: info:  [107873] clone             19817     62.39      1.92    235.90       317.64          83.33        76.24
stress-ng: info:  [177572] clone             19763     60.57      0.89    226.55       326.27          86.89        75.10

After:
ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
stress-ng: info:  [914] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per
stress-ng: info:  [914]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)
stress-ng: info:  [914] clone             19446     60.67      1.83    229.60       320.50          84.03        76.29
stress-ng: info:  [67984] clone             19600     60.63      0.90    226.66       323.26          86.13        75.06
stress-ng: info:  [117843] clone             19665     60.64      0.98    226.97       324.27          86.27        75.18
stress-ng: info:  [167831] clone             19306     61.22      1.20    227.39       315.38          84.46        74.68

These numbers are a bit more fuzzy, but its about 3% extra bogo ops.

There is a test kernel available in the below ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test

If you install it, you too will get 3%+ performance improvement on clone() heavy
workloads.

[Where problems could occur]

We are inlining a hotly used function in the clone() syscall callpath. This
should technically increase the performance due to not having to context switch
between calls to _compound_head(), without much of a downside, apart from
slightly increased binary size, and the inability to livepatch the function.

I checked on cscope, and _compound_head is called from:

compound_head()
page_folio()

both in page-flags.h as #defines. This is going to have a minuscule footprint
change.

The risk of regression is well worth the 3%+ performance gain.

David Hildenbrand (1):
  mm: always inline _compound_head() with
    CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

 include/linux/page-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.45.2