ACK: [SRU][Focal][PATCH 0/1] kvm: Windows 2k19 with Hyper-v role gets stuck on pending hypervisor requests on cascadelake based kvm hosts

William Breathitt Gray william.gray at canonical.com
Fri Jan 22 04:13:23 UTC 2021


On Tue, Jan 19, 2021 at 12:02:31PM +1300, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1911848
> 
> [Impact]
> 
> On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will fail
> to start once they have enabled the hyper-v role for nested virtualisation.
> 
> The Windows Server guests will get stuck in the late stages of boot, before the
> graphical login screen appears, on Windows Server systems with the desktop
> environment installed.
> 
> If you look at performance metrics for the guest, the CPU will appear to be
> stuck at 100%, and it never changes from 100%. The Windows Server guest is
> unresponsive.
> 
> The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some very
> specific settings needed for nested virtualisation. See testcase section.
> If you use any other vcpu type, the problem does not reproduce.
> 
> Known workarounds are to install the 5.8 HWE kernel, in which case the server
> will come up as expected.
> 
> [Fix]
> 
> The following commit fixes the issue, and landed in mainline 5.8-rc1:
> 
> commit 8081ad06b68a728e676d3b08e9ab70ce4039747b
> Author: Sean Christopherson <seanjc at google.com>
> Date:   Wed Apr 22 19:25:40 2020 -0700
> Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set
> Link: https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b
> 
> It appears that pending requests to the hypervisor can be lost or delayed if
> an immediate exit was requested in vcpu_enter_guest(). As the commit message
> mentions, only the !injected case is affected, so we add a check at the
> cancel_injection label to see if we got there as a result of an immediate exit,
> and then re-issue a KVM_REQ_EVENT request if we are.
> 
> The Windows guest is waiting for an event to be processed, which never happens, 
> and so gets stuck.
> 
> Even though the above commit has a Fixes: tag to a commit in 3.15-rc1, in my
> testing the 4.15 kernel with a Bionic-ussuri userspace does not reproduce the
> issue, so SRU to Bionic will not be needed.
> 
> [Testcase]
> 
> A cascadelake based Xeon server is required. Anything else and the bug will not
> reproduce. 
> 
> I used a c5.metal server on AWS. It has the following processor:
> Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> 
> Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can reach
> the desktop. Copy a Windows Server 2k19 image to the destination server, as well
> as a recent ISO image of virtio drivers.
> 
> Install virt-manager.
> 
> Create a new virtual machine using the Windows 2k19 defaults. Use 8 vcpus, 16gb
> ram. Click customise button to change settings before install.
> 
> Change the hard disk to be SATA, attach a new cd rom drive for the virtio
> drivers. Change networking to virtio. Change CPU to Cascadelake-Server-noTSX.
> 
> Edit the virsh xml, and ensure you set the following features for CPU:
> 
>   <cpu mode='custom' match='exact' check='full'>
>     <model fallback='forbid'>Cascadelake-Server-noTSX</model>
>     <topology sockets='8' cores='1' threads='1'/>
>     <feature policy='require' name='invpcid'/>
>     <feature policy='require' name='pcid'/>
>     <feature policy='require' name='vmx'/>
>     <feature policy='require' name='hypervisor'/>
>     <feature policy='disable' name='mpx'/>
>     <feature policy='require' name='pku'/>
>     <feature policy='require' name='arch-capabilities'/>
>     <feature policy='require' name='rdctl-no'/>
>     <feature policy='require' name='ibrs-all'/>
>     <feature policy='require' name='skip-l1dfl-vmentry'/>
>     <feature policy='require' name='mds-no'/>
>   </cpu>
> 
> Those settings are an absolute must.
> 
> Boot the VM, and install Windows 2k19 with the desktop environment. Once it is
> installed, open up computer management > device manager and install drivers from
> the virtio ISO for missing hardware, likely the network and balloon devices.
> 
> From there, go to server manager, and install the hyper-v role.
> 
> Reboot the server. It will reboot a few times, and on the final time, it will
> lock up before it reaches the log in screen.
> 
> In virt-manager, go to the performance tab. The CPU will be stuck at 100%.
> The windows guest will be non responsive.
> 
> A patched kernel is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test
> 
> If you install this kernel and boot the Windows 2k19 guest, it will come up
> normally when the hyper-v role is enabled, and you will be able to log in.
> 
> [Where problems could occur]
> 
> This is a change to a core part of the kvm subsystem, so there is potential
> for regression which could affect all users of KVM.
> 
> If a regression were to occur, there are no workarounds. Users would need to 
> downgrade their kernel while a fix is developed.
> 
> Sean Christopherson (1):
>   KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit
>     set
> 
>  arch/x86/kvm/x86.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> -- 
> 2.27.0
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

Acked-by: William Breathitt Gray <william.gray at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210121/c9d8e4dc/attachment-0001.sig>


More information about the kernel-team mailing list