APPLIED: [SRU][N][PULL] Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0 (LP: 2076866)
Stefan Bader
stefan.bader at canonical.com
Thu Sep 26 16:00:44 UTC 2024
On 05.09.24 15:54, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2076866
>
> [ Impact ]
>
> * A KVM guest (VM) that got live migrated between two Power 10 systems
> (using nested virtualization, means KVM on top of PowerVM) will
> highly likely crash after about an hour.
>
> * At that point it looked like the live migration itself was already
> successful, but it wasn't, and the crash is caused due to it.
>
> [ Test Plan ]
>
> * Setting up two Power 10 systems (with firmware level FW1060 or newer,
> that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
>
> * Setup a qemu/KVM environment that allows to live migrate a KVM
> guest from one P10 system to the other.
>
> * (The disk type does not seem to matter, hence NFS based disk storage
> can be used for example).
>
> * After about an hour the live migrated guest is likely to crash.
> Hence wait for 2 hours (which increases the likeliness) and
> a crash due to:
> "migrate_misplaced_folio+0x540/0x5d0"
> occurs.
>
> [ Where problems could occur ]
>
> * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
> folio might have already been unmapped and the move of the checks
> might have an impact on page table locks if done wrong,
> which may lead to wrong locks, blocked memory and finally crashes.
>
> * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
> 'in-directed', which may lead to a different behaviour and side-effects.
> However, isolation is still done, just slightly different and
> instead of using numamigrate_isolate_folio, now in (the renamed)
> migrate_misplaced_folio_prepare.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
> https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
> https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.com
>
> * Fixing a confusing return code, now to just return 0, on success is
> clarifying the return code handling and usage, and was mainly done in
> preparation of further changes,
> but can have bad side effects if the return code was used in other
> code places already as is.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
> https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.com
>
> * Fixing the fact that NUMA balancing prohibits mTHP
> (multi-size Transparent Hugepage Support) seems to be unreasonable
> since its an exclusive mapping.
> Allowing this seems to bring significant performance improvements
> see commit message d2136d749d76), but introduced significant changes
> PTE mapping and modifications and even relies on further commits:
> 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
> 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses")
> This case cause issues on systems configured for THP,
> may confuse the ordering, which may even lead to memory corruption.
> And this may especially hit (NUMA) systems with high core numbers,
> where balancing is more often needed.
>
> * Further upstream conversations:
> https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
> https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
>
> * The refactoring of the code for NUMA mapping rebuilding and moving
> it into a new helper, seems to be straight forward, since the active code
> stays unchanged, however the new function needs to be callable, but this
> is the case since its all in mm/memory.c.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com
> https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com
>
> * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
> is more significant, since the logic changed from
> (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
> (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
> tables of more than one MM'.
>
> * Since this is an estimation, the results may be unpredictable
> (especially for bigger folios), and not like expected or assumed
> (there are quite some side-notes in the code comments of bb34f78d72c2,
> that mention potential fuzzy results), hence this
> may lead to unforeseen behavior.
>
> * The condition statements became clearer since it's now based on
> (more or less obvious) number counts, but can still be erroneous in
> case folio_estimated_sharers does incorrect calculations.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
> https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
>
> * Commit 133d04b1eee9 extends commit bda420b98505 "numa balancing: migrate
> on fault among multiple bound nodes" from allowing NUMA fault migrations
> when the executing node is part of the policy mask for MPOL_BIND,
> to also support MPOL_PREFERRED_MANY policy.
> Both cases (MPOL_BIND and MPOL_PREFERRED_MANY) are treated in the same way.
> In case the NUMA topology is not correctly considered, changes here
> may lead to decreased memory performance.
> However, the code changes themselves are relatively traceable.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.com
>
> * Finally commit f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead
> of cpu_to_node()") is a patchset to further optimize the cross-socket
> memory access with MPOL_PREFERRED_MANY policy.
> The mpol_misplaced changes are mainly moving from cpu_to_node to
> numa_node_id, and with that make the code more NUMA aware.
> Based on that vm_fault/vmf needs to be considered instead of
> vm_area_struct/vma.
> This may have consequences on the memory policy itself.
>
> * Further upstream conversations:
> https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
> https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
>
> * The overall patch set touches quite a bit of common code,
> but the modifications were intensely discussed with many experts
> in the various mailing-list threads that are referenced above.
>
> [ Other Info ]
>
> * The first two "mm/migrate" commits are the newest and were
> upstream accepted with kernel v6.11(-rc1),
> all other are already upstream since v6.10(-rc1).
>
> * Hence oracular (with a planned target kernel of 6.11) is not affect,
> and the SRU is for noble only.
>
> * And since (nested) KVM virtualization on ppc64el was (re-)introduced
> just with noble, no older Ubuntu releases older than noble are affected.
>
> The following changes since commit 2f325e7ecae4d38c3f0a73d0dc06441ac9c27fd9:
>
> drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds (2024-08-30 17:11:22 +0200)
>
> are available in the Git repository at:
>
> https://git.launchpad.net/~fheimes/+git/lp2076866/ 5d9eb99803ea783f1ef649aef6aca91d600bea6c
>
> for you to fetch changes up to 5d9eb99803ea783f1ef649aef6aca91d600bea6c:
>
> mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (2024-09-05 10:34:38 +0200)
>
> ----------------------------------------------------------------
> Baolin Wang (2):
> mm: factor out the numa mapping rebuilding into a new helper
> mm: support multi-size THP numa balancing
>
> David Hildenbrand (3):
> mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
> mm/migrate: make migrate_misplaced_folio() return 0 on success
> mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
>
> Donet Tom (2):
> mm/mempolicy: use numa_node_id() instead of cpu_to_node()
> mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
>
> include/linux/mempolicy.h | 5 ++-
> include/linux/migrate.h | 7 ++++
> include/linux/mm.h | 48 +++++++++++++++++++-----
> mm/huge_memory.c | 17 +++++----
> mm/internal.h | 2 +-
> mm/madvise.c | 6 +--
> mm/memory.c | 95 +++++++++++++++++++++++++++++++++++------------
> mm/mempolicy.c | 50 ++++++++++++++++---------
> mm/migrate.c | 83 +++++++++++++++++++----------------------
> mm/mprotect.c | 3 +-
> 10 files changed, 207 insertions(+), 109 deletions(-)
>
Applied to noble:linux/master-next. Thanks.
-Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/bbd30b45/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20240926/bbd30b45/attachment-0001.sig>
More information about the kernel-team
mailing list