ACK: [SRU][N][PATCH 0/1] Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test (LP: 2076147)
Stefan Bader
stefan.bader at canonical.com
Thu Sep 26 15:29:04 UTC 2024
On 05.09.24 09:48, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2076147
>
> SRU Justification:
>
> * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
> PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
>
> * It hangs with:
> "Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab"
>
> * Diagnosing the issues points this this fix/upstream-commit:
> [commit message, by Barry Song <v-songbaohua at oppo.com>]
> Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
> modifications preceded by pte clear. While iterating over PTEs of a large folio,
> it only starts acquiring PTL from the first valid (present) PTE.
> PTE modifications can temporarily set PTEs to pte_none.
> Consequently, the initial PTEs of a large folio might be skipped
> in try_to_unmap_one().
> For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
> still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
> try_to_unmap_one().
> So folio will be still mapped, the folio fails to be reclaimed and is put
> back to LRU in this round.
> This also breaks up PTEs optimization such as CONT-PTE on this large folio
> and may lead to accident folio_split() afterwards.
> And since a part of PTEs are now swap entries, accessing those parts will
> introduce overhead - do_swap_page.
> Although the kernel can withstand all of the above issues, the situation
> still seems quite awkward and warrants making it more ideal.
> The same race also occurs with small folios, but they have only one PTE,
> thus, it won't be possible for them to be partially unmapped.
> This patch [see below] holds PTL from PTE0, allowing us to avoid reading
> PTE values that are in the process of being transformed. With stable PTE
> values, we can ensure that this large folio is either completely reclaimed
> or that all PTEs remain untouched in this round.
> A corner case is that if we hold PTL from PTE0 and most initial PTEs have
> been really unmapped before that, we may increase the duration of holding
> PTL. Thus we only apply this optimization to folios which are still entirely
> mapped (not in deferred_split list).
>
> [ Fix ]
>
> * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
> "mm: hold PTL from the first PTE while reclaiming a large folio"
>
> [ Test Plan ]
>
> * An IBM Power 10 system (where PowerVM is mandatory)
> running Ubuntu Server 24.04 (kernel 6.8) or later
> with (nested) KVM setup (so KVM on top of PowerVM).
>
> * Run LTP test suite
> Tests running: SLS(io,base)
>
> * Without the patch the above test will hang with
> Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab
>
> [ Where problems could occur ]
>
> * This is a common code change in the memory management sub-system,
> hence great care needs to be taken, even if it was discussed upfront
> at the https://lore.kernel.org/ mailing list and the upstream commit
> provenance shows that many eyes had a look at this.
>
> * The modification is relatively small with just one if statement
> (across two lines) in mm/vmscan.c.
>
> * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
> from the first page table entry (PTE) and to eliminate the influence of
> temporary and volatile PTE values.
>
> * If done wrong it can especially have a negative impact in case of large folios.
> and wrong hints might be given to try_to_unmap
> which may lead to bad page swapping.
>
> * In case of an issue with this patch the result can also be decreased
> performance and efficiency in the page table handling - the opposite
> of what the patch is supposed to address.
>
> * Fortunately several developers had their eyes on this commit,
> as the provenance of the patch and the discussion at LKML shows.
>
> * Further upstream conversation:
> Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com
>
> [ Other Info ]
>
> * The commit is upstream since v6.10(-rc1), hence it will be included
> in oracular with the planned target kernel of 6.11.
>
> * And since (nested) KVM virtualization on ppc64el was (re-)introduced
> just with noble, no older Ubuntu releases older than noble are affected.
>
> Barry Song (1):
> mm: hold PTL from the first PTE while reclaiming a large folio
>
> mm/vmscan.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/8181a593/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/8181a593/attachment-0001.sig>
More information about the kernel-team
mailing list