APPLIED: [SRU][N][O][PATCH 0/1] By always inlining _compound_head(), clone() sees 3%+ performance increase

Roxana Nicolescu roxana.nicolescu at canonical.com
Thu Nov 28 14:26:44 UTC 2024


On 22/11/2024 02:06, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2089327
>
> [Impact]
>
> _compound_head() is called frequently during clone() heavy workloads with
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y set, so much that it is worthwhile
> always inlining it for a slight 3%+ performance improvement during clone().
>
> Over the lifecycle of Noble, Oracular it could save significant amounts of
> CPU time during clone(), and save a large amount of electricity. We should
> always inline _compound_head() and take advantage of the performance boost.
>
> [Fix]
>
> This was fixed in 6.12-rc1 by:
>
> commit ef5f379de302884b9b7ad9b62587a942a9f0bb55
> Author: David Hildenbrand <david at redhat.com>
> Date:  Tue Aug 20 14:22:10 2024 +0200
> Subject: mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef5f379de302884b9b7ad9b62587a942a9f0bb55
>
> This commit is intended to offset the performance loss caused by:
>
>   c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")
>   
> which landed in 6.10-rc1, but the change is generic enough that Noble users
> would benefit from the fix as well. They bring both Noble and Oracular +3%.
>
> [Testcase]
>
> clone() heavy workloads are best to show the performance increase.
>
> Originally, the user who requested this is running an Ansible heavy workload,
> and finds that clone() bottlenecks during large runs of Ansible against
> thousands of containers and hosts.
>
> They benchmarked 6.8.0-49-generic against a patched test kernel of the same
> 6.8.0-49-generic and found:
>
> Before:
>      08:24:23: Rename subiquity netplan config
>      08:36:12: hostendpoint_monitoring: Create log directory (10990)
>      = 11m49s
>       
>      08:37:59: Rename subiquity netplan config
>      08:49:49: hostendpoint_monitoring: Create log directory (10991)
>      = 11m50s
>      
> After:
>      08:55:16: Rename subiquity netplan config
>      09:06:28: hostendpoint_monitoring: Create log directory (10991)
>      = 11m12s
>       
>      09:08:59: Rename subiquity netplan config
>      09:20:22: hostendpoint_monitoring: Create log directory (10991)
>      = 11m23s
>      
> Take 11m23s versus 11m49s, for a 3.6%+ performance improvement. This adds up
> over thousands of hosts.
>
> I did some basic tests with stress-ng using the clone() stressor.
>
> I ran:
>
> $ sudo apt install stress-ng
> $ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
>
> Before:
> ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
> stress-ng: info:  [953] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per
> stress-ng: info:  [953]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)
> stress-ng: info:  [953] clone             19919     61.80      2.19    232.84       322.29          84.75        76.06
> stress-ng: info:  [55777] clone             19540     61.17      1.75    229.32       319.42          84.56        75.55
> stress-ng: info:  [107873] clone             19817     62.39      1.92    235.90       317.64          83.33        76.24
> stress-ng: info:  [177572] clone             19763     60.57      0.89    226.55       326.27          86.89        75.10
>
> After:
> ubuntu at jammy-test:~$ sudo stress-ng --seq=5 --clone 5 --timeout=60 --metrics
> stress-ng: info:  [914] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per
> stress-ng: info:  [914]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)
> stress-ng: info:  [914] clone             19446     60.67      1.83    229.60       320.50          84.03        76.29
> stress-ng: info:  [67984] clone             19600     60.63      0.90    226.66       323.26          86.13        75.06
> stress-ng: info:  [117843] clone             19665     60.64      0.98    226.97       324.27          86.27        75.18
> stress-ng: info:  [167831] clone             19306     61.22      1.20    227.39       315.38          84.46        74.68
>
> These numbers are a bit more fuzzy, but its about 3% extra bogo ops.
>
> There is a test kernel available in the below ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf401086-test
>
> If you install it, you too will get 3%+ performance improvement on clone() heavy
> workloads.
>
> [Where problems could occur]
>
> We are inlining a hotly used function in the clone() syscall callpath. This
> should technically increase the performance due to not having to context switch
> between calls to _compound_head(), without much of a downside, apart from
> slightly increased binary size, and the inability to livepatch the function.
>
> I checked on cscope, and _compound_head is called from:
>
> compound_head()
> page_folio()
>
> both in page-flags.h as #defines. This is going to have a minuscule footprint
> change.
>
> The risk of regression is well worth the 3%+ performance gain.
>
> David Hildenbrand (1):
>    mm: always inline _compound_head() with
>      CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>
>   include/linux/page-flags.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
Applied to oracular:linux, noble:linux master-next branches. Thanks!



More information about the kernel-team mailing list