ACK/Cmnt: [SRU][N][PULL] Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0 (LP: 2076866)
Stefan Bader
stefan.bader at canonical.com
Wed Sep 25 14:18:10 UTC 2024
On 05.09.24 15:54, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2076866
>
> [ Impact ]
>
> * A KVM guest (VM) that got live migrated between two Power 10 systems
> (using nested virtualization, means KVM on top of PowerVM) will
> highly likely crash after about an hour.
>
> * At that point it looked like the live migration itself was already
> successful, but it wasn't, and the crash is caused due to it.
>
> [ Test Plan ]
>
> * Setting up two Power 10 systems (with firmware level FW1060 or newer,
> that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
>
> * Setup a qemu/KVM environment that allows to live migrate a KVM
> guest from one P10 system to the other.
>
> * (The disk type does not seem to matter, hence NFS based disk storage
> can be used for example).
>
> * After about an hour the live migrated guest is likely to crash.
> Hence wait for 2 hours (which increases the likeliness) and
> a crash due to:
> "migrate_misplaced_folio+0x540/0x5d0"
> occurs.
>
> [ Where problems could occur ]
>
> * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
> folio might have already been unmapped and the move of the checks
> might have an impact on page table locks if done wrong,
> which may lead to wrong locks, blocked memory and finally crashes.
>
> * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
> 'in-directed', which may lead to a different behaviour and side-effects.
> However, isolation is still done, just slightly different and
> instead of using numamigrate_isolate_folio, now in (the renamed)
> migrate_misplaced_folio_prepare.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
> https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
> https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.com
>
> * Fixing a confusing return code, now to just return 0, on success is
> clarifying the return code handling and usage, and was mainly done in
> preparation of further changes,
> but can have bad side effects if the return code was used in other
> code places already as is.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
> https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.com
>
> * Fixing the fact that NUMA balancing prohibits mTHP
> (multi-size Transparent Hugepage Support) seems to be unreasonable
> since its an exclusive mapping.
> Allowing this seems to bring significant performance improvements
> see commit message d2136d749d76), but introduced significant changes
> PTE mapping and modifications and even relies on further commits:
> 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
> 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses")
> This case cause issues on systems configured for THP,
> may confuse the ordering, which may even lead to memory corruption.
> And this may especially hit (NUMA) systems with high core numbers,
> where balancing is more often needed.
>
> * Further upstream conversations:
> https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
> https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
>
> * The refactoring of the code for NUMA mapping rebuilding and moving
> it into a new helper, seems to be straight forward, since the active code
> stays unchanged, however the new function needs to be callable, but this
> is the case since its all in mm/memory.c.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com
>
> * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
> is more significant, since the logic changed from
> (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
> (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
> tables of more than one MM'.
>
> * Since this is an estimation, the results may be unpredictable
> (especially for bigger folios), and not like expected or assumed
> (there are quite some side-notes in the code comments of bb34f78d72c2,
> that mention potential fuzzy results), hence this
> may lead to unforeseen behavior.
>
> * The condition statements became clearer since it's now based on
> (more or less obvious) number counts, but can still be erroneous in
> case folio_estimated_sharers does incorrect calculations.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
> https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
>
> * Commit 133d04b1eee9 extends commit bda420b98505 "numa balancing: migrate
> on fault among multiple bound nodes" from allowing NUMA fault migrations
> when the executing node is part of the policy mask for MPOL_BIND,
> to also support MPOL_PREFERRED_MANY policy.
> Both cases (MPOL_BIND and MPOL_PREFERRED_MANY) are treated in the same way.
> In case the NUMA topology is not correctly considered, changes here
> may lead to decreased memory performance.
> However, the code changes themselves are relatively traceable.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
>
> * Finally commit f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead
> of cpu_to_node()") is a patchset to further optimize the cross-socket
> memory access with MPOL_PREFERRED_MANY policy.
> The mpol_misplaced changes are mainly moving from cpu_to_node to
> numa_node_id, and with that make the code more NUMA aware.
> Based on that vm_fault/vmf needs to be considered instead of
> vm_area_struct/vma.
> This may have consequences on the memory policy itself.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
>
> * The overall patch set touches quite a bit of common code,
> but the modifications were intensely discussed with many experts
> in the various mailing-list threads that are referenced above.
>
> [ Other Info ]
>
> * The first two "mm/migrate" commits are the newest and were
> upstream accepted with kernel v6.11(-rc1),
> all other are already upstream since v6.10(-rc1).
>
> * Hence oracular (with a planned target kernel of 6.11) is not affect,
> and the SRU is for noble only.
>
> * And since (nested) KVM virtualization on ppc64el was (re-)introduced
> just with noble, no older Ubuntu releases older than noble are affected.
>
> The following changes since commit 2f325e7ecae4d38c3f0a73d0dc06441ac9c27fd9:
>
> drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds (2024-08-30 17:11:22 +0200)
>
> are available in the Git repository at:
>
> https://git.launchpad.net/~fheimes/+git/lp2076866/ 5d9eb99803ea783f1ef649aef6aca91d600bea6c
>
> for you to fetch changes up to 5d9eb99803ea783f1ef649aef6aca91d600bea6c:
>
> mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (2024-09-05 10:34:38 +0200)
>
> ----------------------------------------------------------------
> Baolin Wang (2):
> mm: factor out the numa mapping rebuilding into a new helper
> mm: support multi-size THP numa balancing
>
> David Hildenbrand (3):
> mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
> mm/migrate: make migrate_misplaced_folio() return 0 on success
> mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
>
> Donet Tom (2):
> mm/mempolicy: use numa_node_id() instead of cpu_to_node()
> mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
>
> include/linux/mempolicy.h | 5 ++-
> include/linux/migrate.h | 7 ++++
> include/linux/mm.h | 48 +++++++++++++++++++-----
> mm/huge_memory.c | 17 +++++----
> mm/internal.h | 2 +-
> mm/madvise.c | 6 +--
> mm/memory.c | 95 +++++++++++++++++++++++++++++++++++------------
> mm/mempolicy.c | 50 ++++++++++++++++---------
> mm/migrate.c | 83 +++++++++++++++++++----------------------
> mm/mprotect.c | 3 +-
> 10 files changed, 207 insertions(+), 109 deletions(-)
>
This is a relatively bug and from the length of explanations complex
change to generic code. How much this affects other systems is hard to
predict. AMD based PCs tended to be NUMA, too. So we need all the help
in ensuring this does not cause breakage we can get.
Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240925/ba379fb3/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240925/ba379fb3/attachment-0001.sig>
More information about the kernel-team
mailing list