NAK: [SRU][Yakkety][PATCH 0/5] Fixes for LP:#1680513

Fri Jul 21 07:48:40 UTC 2017

On 18.07.2017 14:05, Gavin Guo wrote:
> BugLink: http://bugs.launchpad.net/bugs/1680513
> 
> [Impact]
> After numad is enabled and there are several VMs running on the same
> host machine(host kernel version: 4.4.0-72-generic #93), the
> softlockup messages can be observed inside the VMs' dmesg.
> 
> First, the crashdump was captured when the symptom was observed. At
> the first glance, it looks like an IPI lost issue. The numad process
> initiates a migration of memory, and as part of this, needs to flush
> the TLB cache of another CPU. When the crash dump was taken, that
> other CPU has the TLB flush pending, but not executed.
> 
> The numad kernel task is holding a semaphore lock mmap_sem(for the
> VM's memory) to do the migration, and the tasks that actually end up
> being blocked are other virtual CPUs for the same VM. These tasks need
> to access or make changes to the memory map for the VM because of the
> VM page fault, but cannot acquire the semaphore lock.
> 
> However, the original thoughts on the root cause (unhandled IPI or csd
> lock issue) are incorrect.
> 
> We originally suspected an issue with a lost IPI (inter processor
> interrupt) that performs remote CPU cache flushes during page
> migration, or a known issue with the "csd" lock used to synchronize
> the remote CPU cache flush. A lost IPI would be a function of the
> system firmware or chipset (it is not a CPU issue), but the known csd
> issue is hardware independent.
> 
> Gavin created the hotfix kernel with changes in the csd_lock_wait
> function that would time out if the unlock never happens (the end
> result of either cause), and print messages to the console when that
> timeout occurred. The messages look like:
> 
> csd_lock_wait called %d times
> 
> csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d
> 
> However, the VMs are still experiencing the hangs, but the
> csd_lock_wait timeout is not happening. This suggests that the csd
> lock / lost IPI is not the actual cause.
> 
> In the crash dump, the numad task has induced a migration, and the
> stack is as follows:
> 
> #1 [ffff885f8fb4fb78] smp_call_function_many
> #2 [ffff885f8fb4fbc0] native_flush_tlb_others
> #3 [ffff885f8fb4fc08] flush_tlb_page
> #4 [ffff885f8fb4fc30] ptep_clear_flush
> #5 [ffff885f8fb4fc60] try_to_unmap_one
> #6 [ffff885f8fb4fcd0] rmap_walk_ksm
> #7 [ffff885f8fb4fd28] rmap_walk
> #8 [ffff885f8fb4fd80] try_to_unmap
> #9 [ffff885f8fb4fdc8] migrate_pages
> #10 [ffff885f8fb4fe80] do_migrate_pages
> 
> The frame #1 is actually in the csd_lock_wait function mentioned
> above, but the compiler has optimized that call and it does not appear
> in the stack.
> 
> What happens here is that do_migrate_pages (frame #10) acquires the
> semaphore that everything else is waiting for (and that eventually
> produce the hang warnings), and it holds that semaphore for the
> duration of the page migration. This strongly suggests that this
> single do_migrate_pages call is taking in excess of 10 seconds, and if
> the csd lock is not stuck, then something else within its call path is
> not functioning correctly.
> 
> We originally suspected that the lost IPI/csd lock hang was
> responsible for the hung task timeouts, but in the absence of the csd
> warning messages, the cause presumably lies elsewhere.
> 
> A KSM function appears in frame #6; this is the function that will
> search out the merged pages to handle them for the migration.
> 
> Gavin have tried to disassemble the code and finally find the
> stable_node->hlist is as long as 2306920 entries:
> 
> rmap_item list(stable_node->hlist):
> stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0
> 
> struct hlist_head {
> [0] struct hlist_node *first;
> }
> struct hlist_node {
> [0] struct hlist_node *next;
> [8] struct hlist_node **pprev;
> }
> 
> crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst
> 
> $ wc -l rmap_item.lst
> 2306920 rmap_item.lst
> 
> This is roughly 9 GB of pages. The theory is that KSM has merged a
> very large number of pages that are empty (the value of all locations
> in the page are zero).
> 
> The bug can be observed by the perf flame graph[1]:
> 
> [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg
> 
> [Fix]
> Andrea Arcangeli already sent out the patch[2] in the 2015/11/10.
> Andrew Morton also said he will apply the patch. However, the patch
> finally disappears from the mmtom tree in April 2016. Andrea suggested
> apply the 3 patches[3].
> 
> [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
> deduplication limit
> http://www.spinics.net/lists/linux-mm/msg96866.html
> 
> [3]. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
> deduplication limit
> https://www.spinics.net/lists/linux-mm/msg113829.html
> 
> [Test Case]
> The patches has been tested with 9 VMs and each has 32GB ram and 16
> VCPUs. Numad/KSM are also enabled in the machine. After running for
> 6 days, the system is stable and unstable CPU loading cannot be
> observed inside the virtual appliances monitor[4]. The numad cpu
> utilization rate is normal and guest hang also cannot be observed.
> 
> Machine type: Dell PowerEdge R920
> Memory: 528GB with 4 NUMA nodes
> CPU: 120 cores
> 
> [4]. http://kernel.ubuntu.com/~gavinguo/sf00131845/virtual_appliances_loading.png
> 
> Andrea Arcangeli (5):
>   ksm: introduce ksm_max_page_sharing per page deduplication limit
>   ksm: fix use after free with merge_across_nodes = 0
>   ksm: cleanup stable_node chain collapse case
>   ksm: swap the two output parameters of chain/chain_prune
>   ksm: optimize refile of stable_node_dup at the head of the chain
> 
>  Documentation/vm/ksm.txt |  63 ++++
>  mm/ksm.c                 | 820 +++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 817 insertions(+), 66 deletions(-)
> 
Not considering for Yakkety as that has reached EOL.

-Stefan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20170721/416cc1c8/attachment.sig>