ACK/Cmnt: [SRU][B][F][PATCH 0/1] KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host
Stefan Bader
stefan.bader at canonical.com
Tue Aug 18 08:08:09 UTC 2020
On 17.08.20 01:51, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1837810
>
> [Impact]
>
> We are seeing a problem on OpenStack compute nodes, and KVM hosts, where a kernel
> oops is generated, and all running KVM machines are placed into the pause state.
>
> This is caused by the kernel's reserved zero_page reference counter overflowing
> from a positive number to a negative number, and hitting a
> (WARN_ON_ONCE(page_ref_count(page) <= 0)) condition in try_get_page().
>
> This only happens if the machine has Kernel Samepage Mapping (KSM) enabled,
> with "use_zero_pages" turned on. Each time a new VM starts and the kernel does
> a KSM merge run during a EPT violation, the reference counter for the zero_page
> is incremented in try_async_pf() and never decremented. Eventually, the reference
> counter will overflow, causing the KVM subsystem to fail.
>
> Syslog:
> error : qemuMonitorJSONCheckError:392 : internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required
>
> QEMU Logs:
> error: kvm run failed Bad address
> EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
> ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
> EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
> CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
> SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
> DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
> FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
> GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
> LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
> GDT= 000f7040 00000037
> IDT= 000f707e 00000000
> CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31
>
> Kernel Oops:
>
> [ 167.695986] WARNING: CPU: 1 PID: 3016 at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 follow_page_pte+0x6f4/0x710
> [ 167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G OE 4.15.0-106-generic #107~16.04.1-Ubuntu
> [ 167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
> [ 167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710
> [ 167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286
> [ 167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 0000000080000000
> [ 167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 80000001b8cea225
> [ 167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: ffff90c4d55fa340
> [ 167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffed8786e33a80
> [ 167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: ffff90c4d55fa340
> [ 167.696030] FS: 00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) knlGS:0000000000000000
> [ 167.696030] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 0000000000162ee0
> [ 167.696033] Call Trace:
> [ 167.696047] follow_pmd_mask+0x273/0x630
> [ 167.696049] follow_page_mask+0x178/0x230
> [ 167.696051] __get_user_pages+0xb8/0x740
> [ 167.696052] get_user_pages+0x42/0x50
> [ 167.696068] __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
> [ 167.696079] ? mmu_set_spte+0x1dd/0x3a0 [kvm]
> [ 167.696090] try_async_pf+0x66/0x220 [kvm]
> [ 167.696101] tdp_page_fault+0x14b/0x2b0 [kvm]
> [ 167.696104] ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
> [ 167.696114] kvm_mmu_page_fault+0x62/0x180 [kvm]
> [ 167.696117] handle_ept_violation+0xbc/0x160 [kvm_intel]
> [ 167.696119] vmx_handle_exit+0xa5/0x580 [kvm_intel]
> [ 167.696129] vcpu_enter_guest+0x414/0x1260 [kvm]
> [ 167.696138] ? kvm_arch_vcpu_load+0x4d/0x280 [kvm]
> [ 167.696148] kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
> [ 167.696157] ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
> [ 167.696165] kvm_vcpu_ioctl+0x33a/0x610 [kvm]
> [ 167.696166] ? do_futex+0x129/0x590
> [ 167.696171] ? __switch_to+0x34c/0x4e0
> [ 167.696174] ? __switch_to_asm+0x35/0x70
> [ 167.696176] do_vfs_ioctl+0xa4/0x600
> [ 167.696177] SyS_ioctl+0x79/0x90
> [ 167.696180] ? exit_to_usermode_loop+0xa5/0xd0
> [ 167.696181] do_syscall_64+0x73/0x130
> [ 167.696182] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [ 167.696184] RIP: 0033:0x7f6a80482007
> [ 167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f6a80482007
> [ 167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016
> [ 167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 0000000000000001
> [ 167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 000055fe135f3240
> [ 167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f
> [ 167.696200] ---[ end trace 7573f6868ea8f069 ]---
>
> [Fix]
>
> This was fixed in 5.6-rc1 with the following commit:
>
> commit 7df003c85218b5f5b10a7f6418208f31e813f38f
> Author: Zhuang Yanying <ann.zhuangyanying at huawei.com>
> Date: Sat Oct 12 11:37:31 2019 +0800
> Subject: KVM: fix overflow of zero page refcount with ksm running
> Link: https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f
>
> The fix adds a check to see if the Page Frame Number (pfn) is linked to the zero
> page, and if it is, treats it as reserved. This has the effect that put_page()
> is no longer called on the zero_page, and reference counting is no longer needed.
>
> This is a clean cherry pick to Bionic and Focal kernels.
>
> [Testcase]
>
> Create a new KVM host, and make sure it has plenty of ram. 16gb should be okay.
>
> Install KVM packages:
>
> $ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage virtinst
>
> Enable Kernel Samepage Mapping, and use_zero_pages:
>
> $ echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan
> $ echo 1 | sudo tee /sys/kernel/mm/ksm/run
> $ echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages
>
> I wrote a script which creates and destroys xenial KVM VMs in a infinite loop:
> https://paste.ubuntu.com/p/CvRTsDkdC7/
>
> Save the script to disk, and execute it:
>
> $ chmod +x ksm_refcnt_overflow.sh
> $ ./ksm_refcnt_overflow.sh
>
> Each time a VM is created and destroyed the reference counter will increase.
>
> I wrote a kernel module which exposes a /proc interface, which we can use to
> look at the value of the zero_page reference counter. It works by taking the
> memory allocated for the zero page: empty_zero_page, which is defined in
> arch/x86/include/asm/pgtable.h, running virt_to_page() to get the page struct,
> which we can then dereference to get _refcount;
>
> https://paste.ubuntu.com/p/MJMN8jMVds/
>
> Save the module to disk, create its Makefile from the included documentation,
> and build it:
>
> $ make
> $ sudo insmod zero_page_refcount.ko
>
> From there, we can examine the reference counter with:
>
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x687 or 1671
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x846 or 2118
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x9f8 or 2552
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0xcb2 or 3250
>
> We see it steadily increase. Instead of waiting months for it to overflow, I
> implemented a /proc entry to set it to near overflow. You can use it with:
>
> $ cat /proc/zero_page_refcount_set
> Zero Page Refcount set to 0x1FFFFFFFFF000
>
> After that, wait a few seconds and the reference counter will overflow:
>
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x7fffff16 or 2147483414
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x80000000 or -2147483648
>
> All VMs will become paused:
>
> $ virsh list
> Id Name State
> ----------------------------------------------------
> 1 instance-0 paused
> 2 instance-1 paused
>
> QEMU will error out, and the kernel will oops with the messages in the impact
> section.
>
> I built a test kernel, which is available here:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf290373-test
>
> If you install the test kernel and try reproduce, you will notice the reference
> counter is never incremented past 1:
>
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x1 or 1
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x1 or 1
> $ cat /proc/zero_page_refcount
> Zero Page Refcount: 0x1 or 1
>
> This resolves the problem.
>
> [Regression Potential]
>
> While the change itself seems simple, it changes how the kernel treats the
> zero_page. The zero_page is important, since it is just a page full of 0's.
> Each time memory is allocated which is all 0s, the kernel sets it to use the
> zero_page to save memory. When an application writes to the buffer, a EPT
> violation happens, and the kernel does a COW to new pages to hold the data.
>
> The change is limited to how the KVM subsystem handles the zero_page.
> This will not break the entire kernel if a regression occurs, only KVM.
>
> If a regression were to occur, users could turn off KSM and disable KSM
> use_zero_pages until a fix is ready, as this particular use of zero_pages is
> limited to KSM.
>
> The fix landed in upstream 5.6, and has not been backported to stable kernels.
>
> I have read a bit of the paging code, especially around where the zero_page is
> used, and where its reference counters were being incorrectly incremented.
> I think the fix is correct, and I believe it won't cause any regressions.
>
> Zhuang Yanying (1):
> KVM: fix overflow of zero page refcount with ksm running
>
> virt/kvm/kvm_main.c | 1 +
> 1 file changed, 1 insertion(+)
>
Ok, the change stops treating zero pages as reserved in kvm_is_reserved_pfn. Not
visible from the patch but assuming that this prevents incrementing the reserved
count there. Any regression would occur when using KSM, so we need to be careful
to get verification explicitly done.
Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20200818/a2515d15/attachment.sig>
More information about the kernel-team
mailing list