[PULL][Noble] backport arm64 THP improvements from 6.9

Fri Mar 29 12:46:36 UTC 2024

On Fri, Mar 29, 2024 at 09:50:10AM +0100, Andrea Righi wrote:
> On Thu, Mar 28, 2024 at 06:25:45AM -0600, dann frazier wrote:
> > BugLink: https://bugs.launchpad.net/bugs/2059316
> > 
> > These are all clean cherry-picks from upstream, save one that required
> > minor backporting due to an API change. This does touch generic code.
> > I've regression tested as much as I've had time to given I just
> > learned about this patchset last week. But with the performance
> > improvements I've measured, and the GPU workload improvements reported
> > upstream, I think this is something we need to seriously consider
> > for our next LTS.
> > 
> > I plan to monitor upstream for Fixes commits during the 6.9
> > development cycle, in case regressions are found/fixed.
> 
> The PR itself looks good and it'd be a nice improvement to apply to the
> noble kernel.
> 
> However, this is not a small and definitely not self-contained, since it
> touches various parts of mm, especially for arm64.

Agreed.

> My concern is that something like this may add a significant burden when
> applying stable updates, having risk of conflicts and such, that could
> significantly slow down the release of security fixes.

Yes, that is certainly a valid concern. On the plus side, we're just
moving this code forward - so I'd expect 6.9-targeted backports to apply
cleanly in most cases. But I have hit those cases where we have a
security patch that touches the backport code and the non-backport
code, preventing us from using the either the 6.8-stable or 6.9-stable
backport directly, so I hear you.

I will note that this is planned to be requested for the -nvidia
kernel. With 10*x* performance benefits, I expect that to be pushed
pretty hard. If accepted, we'd end up needing to do that security
maintenance work anyway. -nvidia users will be using it in anger for
x86 and ARM, so I'm not sure how much "blast radius" we would avoid by
keeping it there. And by having it in the base kernel, of course,
non-NVIDIA ARM systems like Ampere and our cloud kernels will also
benefit.

> Do you think it's reasonable to ping some upstream people and see if
> they're willing to apply this to the stable branch for 6.8? I understand
> it's not really a fix, but given the benefits it might be a reasonable
> change to backport.

IMO, it does not meet the criteria for stable. We could argue that it
is a notable performance *issue* - but given it isn't addressing a
regression, I feel that would be a stretch. This feels to me more like
a feature.

  -dann

> > 
> > The following changes since commit 4427c45609f6faf3b5b15e9e9246caa87894c36f:
> > 
> >   UBUNTU: Ubuntu-6.8.0-20.20 (2024-03-18 11:08:14 +0100)
> > 
> > are available in the Git repository at:
> > 
> >   git://git.launchpad.net/~dannf/ubuntu/+source/linux/+git/linux noble-mthp
> > 
> > for you to fetch changes up to 84cfdc5c9342767b6abc872a4baad5a50a58d4f0:
> > 
> >   arm64/mm: improve comment in contpte_ptep_get_lockless() (2024-03-27 15:59:18 -0600)
> > 
> > ----------------------------------------------------------------
> > David Hildenbrand (14):
> >       arm/pgtable: define PFN_PTE_SHIFT
> >       nios2/pgtable: define PFN_PTE_SHIFT
> >       powerpc/pgtable: define PFN_PTE_SHIFT
> >       riscv/pgtable: define PFN_PTE_SHIFT
> >       s390/pgtable: define PFN_PTE_SHIFT
> >       sparc/pgtable: define PFN_PTE_SHIFT
> >       mm/pgtable: make pte_next_pfn() independent of set_ptes()
> >       arm/mm: use pte_next_pfn() in set_ptes()
> >       powerpc/mm: use pte_next_pfn() in set_ptes()
> >       mm/memory: factor out copying the actual PTE in copy_present_pte()
> >       mm/memory: pass PTE to copy_present_pte()
> >       mm/memory: optimize fork() with PTE-mapped THP
> >       mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
> >       mm/memory: ignore writable bit in folio_pte_batch()
> > 
> > Ryan Roberts (21):
> >       arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary
> >       mm: clarify the spec for set_ptes()
> >       mm: thp: batch-collapse PMD with set_ptes()
> >       mm: introduce pte_advance_pfn() and use for pte_next_pfn()
> >       arm64/mm: convert pte_next_pfn() to pte_advance_pfn()
> >       x86/mm: convert pte_next_pfn() to pte_advance_pfn()
> >       mm: tidy up pte_next_pfn() definition
> >       arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep)
> >       arm64/mm: convert set_pte_at() to set_ptes(..., 1)
> >       arm64/mm: convert ptep_clear() to ptep_get_and_clear()
> >       arm64/mm: new ptep layer to manage contig bit
> >       arm64/mm: dplit __flush_tlb_range() to elide trailing DSB
> >       arm64/mm: wire up PTE_CONT for user mappings
> >       arm64/mm: implement new wrprotect_ptes() batch API
> >       arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs
> >       mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()
> >       arm64/mm: implement pte_batch_hint()
> >       arm64/mm: __always_inline to improve fork() perf
> >       arm64/mm: automatically fold contpte mappings
> >       arm64/mm: export contpte symbols only to GPL users
> >       arm64/mm: improve comment in contpte_ptep_get_lockless()
> > 
> > dann frazier (1):
> >       UBUNTU: [Config] arm64: ARM64_CONTPTE=y
> > 
> >  arch/arm/include/asm/pgtable.h      |   2 +
> >  arch/arm/mm/mmu.c                   |   2 +-
> >  arch/arm64/Kconfig                  |   9 +
> >  arch/arm64/include/asm/pgtable.h    | 431 +++++++++++++++++++++++++++++++-----
> >  arch/arm64/include/asm/tlbflush.h   |  13 +-
> >  arch/arm64/kernel/efi.c             |   4 +-
> >  arch/arm64/kernel/mte.c             |   2 +-
> >  arch/arm64/kvm/guest.c              |   2 +-
> >  arch/arm64/mm/Makefile              |   1 +
> >  arch/arm64/mm/contpte.c             | 408 ++++++++++++++++++++++++++++++++++
> >  arch/arm64/mm/fault.c               |  12 +-
> >  arch/arm64/mm/fixmap.c              |   4 +-
> >  arch/arm64/mm/hugetlbpage.c         |  40 ++--
> >  arch/arm64/mm/kasan_init.c          |   6 +-
> >  arch/arm64/mm/mmu.c                 |  16 +-
> >  arch/arm64/mm/pageattr.c            |   6 +-
> >  arch/arm64/mm/trans_pgd.c           |   6 +-
> >  arch/nios2/include/asm/pgtable.h    |   2 +
> >  arch/powerpc/include/asm/pgtable.h  |   2 +
> >  arch/powerpc/mm/pgtable.c           |   5 +-
> >  arch/riscv/include/asm/pgtable.h    |   2 +
> >  arch/s390/include/asm/pgtable.h     |   2 +
> >  arch/sparc/include/asm/pgtable_64.h |   2 +
> >  arch/x86/include/asm/pgtable.h      |   8 +-
> >  debian.master/config/annotations    |   3 +
> >  include/linux/efi.h                 |   5 +
> >  include/linux/pgtable.h             |  65 +++++-
> >  mm/huge_memory.c                    |  58 ++---
> >  mm/memory.c                         | 219 ++++++++++++++----
> >  29 files changed, 1151 insertions(+), 186 deletions(-)
> >  create mode 100644 arch/arm64/mm/contpte.c
> >