APPLIED: [SRU][N][PULL] Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0 (LP: 2076866)

Thu Sep 26 16:00:44 UTC 2024

On 05.09.24 15:54, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2076866
> 
> [ Impact ]
> 
>   * A KVM guest (VM) that got live migrated between two Power 10 systems
>     (using nested virtualization, means KVM on top of PowerVM) will
>     highly likely crash after about an hour.
> 
>   * At that point it looked like the live migration itself was already
>     successful, but it wasn't, and the crash is caused due to it.
> 
> [ Test Plan ]
> 
>   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
>     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
> 
>   * Setup a qemu/KVM environment that allows to live migrate a KVM
>     guest from one P10 system to the other.
> 
>   * (The disk type does not seem to matter, hence NFS based disk storage
>      can be used for example).
> 
>   * After about an hour the live migrated guest is likely to crash.
>     Hence wait for 2 hours (which increases the likeliness) and
>     a crash due to:
>     "migrate_misplaced_folio+0x540/0x5d0"
>     occurs.
> 
> [ Where problems could occur ]
> 
>   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
>     folio might have already been unmapped and the move of the checks
>     might have an impact on page table locks if done wrong,
>     which may lead to wrong locks, blocked memory and finally crashes.
> 
>   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
>     'in-directed', which may lead to a different behaviour and side-effects.
>     However, isolation is still done, just slightly different and
>     instead of using numamigrate_isolate_folio, now in (the renamed)
>     migrate_misplaced_folio_prepare.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
>     https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
>     https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.com
> 
>   * Fixing a confusing return code, now to just return 0, on success is
>     clarifying the return code handling and usage, and was mainly done in
>     preparation of further changes,
>     but can have bad side effects if the return code was used in other
>     code places already as is.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
>     https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.com
> 
>   * Fixing the fact that NUMA balancing prohibits mTHP
>     (multi-size Transparent Hugepage Support) seems to be unreasonable
>     since its an exclusive mapping.
>     Allowing this seems to bring significant performance improvements
>     see commit message d2136d749d76), but introduced significant changes
>     PTE mapping and modifications and even relies on further commits:
>     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
>     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses")
>     This case cause issues on systems configured for THP,
>     may confuse the ordering, which may even lead to memory corruption.
>     And this may especially hit (NUMA) systems with high core numbers,
>     where balancing is more often needed.
> 
>   * Further upstream conversations:
>     https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
>     https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
>     https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
> 
>   * The refactoring of the code for NUMA mapping rebuilding and moving
>     it into a new helper, seems to be straight forward, since the active code
>     stays unchanged, however the new function needs to be callable, but this
>     is the case since its all in mm/memory.c.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com
>     https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com
>     https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com
> 
>   * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
>     is more significant, since the logic changed from
>     (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
>     (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
>     tables of more than one MM'.
> 
>   * Since this is an estimation, the results may be unpredictable
>     (especially for bigger folios), and not like expected or assumed
>     (there are quite some side-notes in the code comments of bb34f78d72c2,
>     that mention potential fuzzy results), hence this
>     may lead to unforeseen behavior.
> 
>   * The condition statements became clearer since it's now based on
>     (more or less obvious) number counts, but can still be erroneous in
>     case folio_estimated_sharers does incorrect calculations.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
>     https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
> 
>   * Commit 133d04b1eee9 extends commit bda420b98505 "numa balancing: migrate
>     on fault among multiple bound nodes" from allowing NUMA fault migrations
>     when the executing node is part of the policy mask for MPOL_BIND,
>     to also support MPOL_PREFERRED_MANY policy.
>     Both cases (MPOL_BIND and MPOL_PREFERRED_MANY) are treated in the same way.
>     In case the NUMA topology is not correctly considered, changes here
>     may lead to decreased memory performance.
>     However, the code changes themselves are relatively traceable.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
>     https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
> 
>   * Finally commit f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead
>     of cpu_to_node()") is a patchset to further optimize the cross-socket
>     memory access with MPOL_PREFERRED_MANY policy.
>     The mpol_misplaced changes are mainly moving from cpu_to_node to
>     numa_node_id, and with that make the code more NUMA aware.
>     Based on that vm_fault/vmf needs to be considered instead of
>     vm_area_struct/vma.
>     This may have consequences on the memory policy itself.
> 
>   * Further upstream conversations:
>     https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
>     https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
>     https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
>     https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
> 
>   * The overall patch set touches quite a bit of common code,
>     but the modifications were intensely discussed with many experts
>     in the various mailing-list threads that are referenced above.
> 
> [ Other Info ]
> 
>   * The first two "mm/migrate" commits are the newest and were
>     upstream accepted with kernel v6.11(-rc1),
>     all other are already upstream since v6.10(-rc1).
> 
>   * Hence oracular (with a planned target kernel of 6.11) is not affect,
>     and the SRU is for noble only.
> 
>   * And since (nested) KVM virtualization on ppc64el was (re-)introduced
>     just with noble, no older Ubuntu releases older than noble are affected.
> 
> The following changes since commit 2f325e7ecae4d38c3f0a73d0dc06441ac9c27fd9:
> 
>    drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds (2024-08-30 17:11:22 +0200)
> 
> are available in the Git repository at:
> 
>    https://git.launchpad.net/~fheimes/+git/lp2076866/ 5d9eb99803ea783f1ef649aef6aca91d600bea6c
> 
> for you to fetch changes up to 5d9eb99803ea783f1ef649aef6aca91d600bea6c:
> 
>    mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (2024-09-05 10:34:38 +0200)
> 
> ----------------------------------------------------------------
> Baolin Wang (2):
>        mm: factor out the numa mapping rebuilding into a new helper
>        mm: support multi-size THP numa balancing
> 
> David Hildenbrand (3):
>        mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
>        mm/migrate: make migrate_misplaced_folio() return 0 on success
>        mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
> 
> Donet Tom (2):
>        mm/mempolicy: use numa_node_id() instead of cpu_to_node()
>        mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
> 
>   include/linux/mempolicy.h |  5 ++-
>   include/linux/migrate.h   |  7 ++++
>   include/linux/mm.h        | 48 +++++++++++++++++++-----
>   mm/huge_memory.c          | 17 +++++----
>   mm/internal.h             |  2 +-
>   mm/madvise.c              |  6 +--
>   mm/memory.c               | 95 +++++++++++++++++++++++++++++++++++------------
>   mm/mempolicy.c            | 50 ++++++++++++++++---------
>   mm/migrate.c              | 83 +++++++++++++++++++----------------------
>   mm/mprotect.c             |  3 +-
>   10 files changed, 207 insertions(+), 109 deletions(-)
> 

Applied to noble:linux/master-next. Thanks.

-Stefan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/bbd30b45/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/bbd30b45/attachment-0001.sig>