ACK: [SRU][N][PATCH 0/1] Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test (LP: 2076147)

Stefan Bader stefan.bader at canonical.com
Thu Sep 26 15:29:04 UTC 2024


On 05.09.24 09:48, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2076147
> 
> SRU Justification:
> 
>   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
>     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
> 
>   * It hangs with:
>     "Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab"
> 
>   * Diagnosing the issues points this this fix/upstream-commit:
>     [commit message, by Barry Song <v-songbaohua at oppo.com>]
>     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
>     modifications preceded by pte clear. While iterating over PTEs of a large folio,
>     it only starts acquiring PTL from the first valid (present) PTE.
>     PTE modifications can temporarily set PTEs to pte_none.
>     Consequently, the initial PTEs of a large folio might be skipped
>     in try_to_unmap_one().
>     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
>     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
>     try_to_unmap_one().
>     So folio will be still mapped, the folio fails to be reclaimed and is put
>     back to LRU in this round.
>     This also breaks up PTEs optimization such as CONT-PTE on this large folio
>     and may lead to accident folio_split() afterwards.
>     And since a part of PTEs are now swap entries, accessing those parts will
>     introduce overhead - do_swap_page.
>     Although the kernel can withstand all of the above issues, the situation
>     still seems quite awkward and warrants making it more ideal.
>     The same race also occurs with small folios, but they have only one PTE,
>     thus, it won't be possible for them to be partially unmapped.
>     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
>     PTE values that are in the process of being transformed. With stable PTE
>     values, we can ensure that this large folio is either completely reclaimed
>     or that all PTEs remain untouched in this round.
>     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
>     been really unmapped before that, we may increase the duration of holding
>     PTL. Thus we only apply this optimization to folios which are still entirely
>     mapped (not in deferred_split list).
> 
> [ Fix ]
> 
>   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
>     "mm: hold PTL from the first PTE while reclaiming a large folio"
> 
> [ Test Plan ]
> 
>   * An IBM Power 10 system (where PowerVM is mandatory)
>     running Ubuntu Server 24.04 (kernel 6.8) or later
>     with (nested) KVM setup (so KVM on top of PowerVM).
> 
>   * Run LTP test suite
>     Tests running: SLS(io,base)
> 
>   * Without the patch the above test will hang with
>     Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab
> 
> [ Where problems could occur ]
> 
>   * This is a common code change in the memory management sub-system,
>     hence great care needs to be taken, even if it was discussed upfront
>     at the https://lore.kernel.org/ mailing list and the upstream commit
>     provenance shows that many eyes had a look at this.
> 
>   * The modification is relatively small with just one if statement
>     (across two lines) in mm/vmscan.c.
> 
>   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
>     from the first page table entry (PTE) and to eliminate the influence of
>     temporary and volatile PTE values.
> 
>   * If done wrong it can especially have a negative impact in case of large folios.
>     and wrong hints might be given to try_to_unmap
>     which may lead to bad page swapping.
> 
>   * In case of an issue with this patch the result can also be decreased
>     performance and efficiency in the page table handling - the opposite
>     of what the patch is supposed to address.
> 
>   * Fortunately several developers had their eyes on this commit,
>     as the provenance of the patch and the discussion at LKML shows.
> 
>   * Further upstream conversation:
>     Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com
> 
> [ Other Info ]
> 
>   * The commit is upstream since v6.10(-rc1), hence it will be included
>     in oracular with the planned target kernel of 6.11.
> 
>   * And since (nested) KVM virtualization on ppc64el was (re-)introduced
>     just with noble, no older Ubuntu releases older than noble are affected.
> 
> Barry Song (1):
>    mm: hold PTL from the first PTE while reclaiming a large folio
> 
>   mm/vmscan.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 

Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/8181a593/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/8181a593/attachment-0001.sig>


More information about the kernel-team mailing list