ACK: [SRU][N][PULL] Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0 (LP: 2076866)

Thu Sep 26 15:17:43 UTC 2024

Acked-by: Mehmet Basaran <mehmet.basaran at canonical.com>

-------------- next part --------------
frank.heimes at canonical.com writes:

> BugLink: https://bugs.launchpad.net/bugs/2076866
>
> [ Impact ]
>
>  * A KVM guest (VM) that got live migrated between two Power 10 systems
>    (using nested virtualization, means KVM on top of PowerVM) will
>    highly likely crash after about an hour.
>
>  * At that point it looked like the live migration itself was already
>    successful, but it wasn't, and the crash is caused due to it.
>
> [ Test Plan ]
>
>  * Setting up two Power 10 systems (with firmware level FW1060 or newer,
>    that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
>
>  * Setup a qemu/KVM environment that allows to live migrate a KVM
>    guest from one P10 system to the other.
>
>  * (The disk type does not seem to matter, hence NFS based disk storage
>     can be used for example).
>
>  * After about an hour the live migrated guest is likely to crash.
>    Hence wait for 2 hours (which increases the likeliness) and
>    a crash due to:
>    "migrate_misplaced_folio+0x540/0x5d0"
>    occurs.
>
> [ Where problems could occur ]
>
>  * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
>    folio might have already been unmapped and the move of the checks
>    might have an impact on page table locks if done wrong,
>    which may lead to wrong locks, blocked memory and finally crashes.
>
>  * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
>    'in-directed', which may lead to a different behaviour and side-effects.
>    However, isolation is still done, just slightly different and
>    instead of using numamigrate_isolate_folio, now in (the renamed)
>    migrate_misplaced_folio_prepare.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
>    https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
>    https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.com
>
>  * Fixing a confusing return code, now to just return 0, on success is
>    clarifying the return code handling and usage, and was mainly done in
>    preparation of further changes,
>    but can have bad side effects if the return code was used in other
>    code places already as is.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
>    https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.com
>
>  * Fixing the fact that NUMA balancing prohibits mTHP
>    (multi-size Transparent Hugepage Support) seems to be unreasonable
>    since its an exclusive mapping.
>    Allowing this seems to bring significant performance improvements
>    see commit message d2136d749d76), but introduced significant changes
>    PTE mapping and modifications and even relies on further commits:
>    859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
>    80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses")
>    This case cause issues on systems configured for THP,
>    may confuse the ordering, which may even lead to memory corruption.
>    And this may especially hit (NUMA) systems with high core numbers,
>    where balancing is more often needed.
>
>  * Further upstream conversations:
>    https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
>    https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
>    https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
>
>  * The refactoring of the code for NUMA mapping rebuilding and moving
>    it into a new helper, seems to be straight forward, since the active code
>    stays unchanged, however the new function needs to be callable, but this
>    is the case since its all in mm/memory.c.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com
>    https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com
>    https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com
>
>  * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
>    is more significant, since the logic changed from
>    (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
>    (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
>    tables of more than one MM'.
>
>  * Since this is an estimation, the results may be unpredictable
>    (especially for bigger folios), and not like expected or assumed
>    (there are quite some side-notes in the code comments of bb34f78d72c2,
>    that mention potential fuzzy results), hence this
>    may lead to unforeseen behavior.
>
>  * The condition statements became clearer since it's now based on
>    (more or less obvious) number counts, but can still be erroneous in
>    case folio_estimated_sharers does incorrect calculations.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
>    https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
>
>  * Commit 133d04b1eee9 extends commit bda420b98505 "numa balancing: migrate
>    on fault among multiple bound nodes" from allowing NUMA fault migrations
>    when the executing node is part of the policy mask for MPOL_BIND,
>    to also support MPOL_PREFERRED_MANY policy.
>    Both cases (MPOL_BIND and MPOL_PREFERRED_MANY) are treated in the same way.
>    In case the NUMA topology is not correctly considered, changes here
>    may lead to decreased memory performance.
>    However, the code changes themselves are relatively traceable.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
>    https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
>
>  * Finally commit f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead
>    of cpu_to_node()") is a patchset to further optimize the cross-socket
>    memory access with MPOL_PREFERRED_MANY policy.
>    The mpol_misplaced changes are mainly moving from cpu_to_node to
>    numa_node_id, and with that make the code more NUMA aware.
>    Based on that vm_fault/vmf needs to be considered instead of
>    vm_area_struct/vma.
>    This may have consequences on the memory policy itself.
>
>  * Further upstream conversations:
>    https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
>    https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
>    https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
>    https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
>
>  * The overall patch set touches quite a bit of common code,
>    but the modifications were intensely discussed with many experts
>    in the various mailing-list threads that are referenced above.
>
> [ Other Info ]
>
>  * The first two "mm/migrate" commits are the newest and were
>    upstream accepted with kernel v6.11(-rc1),
>    all other are already upstream since v6.10(-rc1).
>
>  * Hence oracular (with a planned target kernel of 6.11) is not affect,
>    and the SRU is for noble only.
>
>  * And since (nested) KVM virtualization on ppc64el was (re-)introduced
>    just with noble, no older Ubuntu releases older than noble are affected.
>
> The following changes since commit 2f325e7ecae4d38c3f0a73d0dc06441ac9c27fd9:
>
>   drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds (2024-08-30 17:11:22 +0200)
>
> are available in the Git repository at:
>
>   https://git.launchpad.net/~fheimes/+git/lp2076866/ 5d9eb99803ea783f1ef649aef6aca91d600bea6c
>
> for you to fetch changes up to 5d9eb99803ea783f1ef649aef6aca91d600bea6c:
>
>   mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (2024-09-05 10:34:38 +0200)
>
> ----------------------------------------------------------------
> Baolin Wang (2):
>       mm: factor out the numa mapping rebuilding into a new helper
>       mm: support multi-size THP numa balancing
>
> David Hildenbrand (3):
>       mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
>       mm/migrate: make migrate_misplaced_folio() return 0 on success
>       mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
>
> Donet Tom (2):
>       mm/mempolicy: use numa_node_id() instead of cpu_to_node()
>       mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
>
>  include/linux/mempolicy.h |  5 ++-
>  include/linux/migrate.h   |  7 ++++
>  include/linux/mm.h        | 48 +++++++++++++++++++-----
>  mm/huge_memory.c          | 17 +++++----
>  mm/internal.h             |  2 +-
>  mm/madvise.c              |  6 +--
>  mm/memory.c               | 95 +++++++++++++++++++++++++++++++++++------------
>  mm/mempolicy.c            | 50 ++++++++++++++++---------
>  mm/migrate.c              | 83 +++++++++++++++++++----------------------
>  mm/mprotect.c             |  3 +-
>  10 files changed, 207 insertions(+), 109 deletions(-)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 873 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/85470b83/attachment-0001.sig>