[Bug 2048517] Re: EPYC-Rome model without XSAVES may break live migration since the removal of the flag on the physical CPU
Andreas Hasenack
2048517 at bugs.launchpad.net
Thu Aug 22 15:00:28 UTC 2024
I released this by mistake, not having seen that the verification tag
wasn't clear yet. Apologies.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/2048517
Title:
EPYC-Rome model without XSAVES may break live migration since the
removal of the flag on the physical CPU
Status in nova package in Ubuntu:
Fix Released
Status in qemu package in Ubuntu:
Invalid
Status in nova source package in Focal:
Fix Released
Status in qemu source package in Focal:
Invalid
Bug description:
[ Impact ]
* Live migration is increasingly being impacted by changes to CPU flags
(e.g., 'xsaves' disabled on AMD EPYC; PKRU/'xsave' behavior changes),
which prevents migration on otherwise identical hypervisors, but the
only difference is a CPU flag (i.e., source hypervisor still has flag
enabled; destination hypervisor had flag disabled on a kernel update).
* These CPU flags updates require changes to CPU model definitions in
several places (qemu, libvirt, and nova if openstack is being used),
which is a lot of overhead for each subtle variation that may appear.
* Fortunately, it's possible to reduce the changes required by allowing
nova to customize CPU flags to enable/disable _on top_ of a CPU model
definition (e.g., the same AMD EPYC CPU model with 'xsaves' disabled).
* This change is present in Jammy and later, and is backward compatible
with the existing config files, as the (new) enable/disable operators
are an optional prefix to existing flags (e.g., '-xsaves' or '+xsaves').
[ Test Plan ]
* Deploy Openstack with 2 hypervisors (or more), and configure nova.conf
with a cpu_model and cpu_extra_flags to disable/enable, for example:
# grep cpu_model /etc/nova/nova.conf
cpu_model = EPYC-Rome
cpu_model_extra_flags = -xsaves
* Start a VM before/after the package upgrade (focal-proposed), checking
the VM XML for that flag (e.g., policy change from require to disable);
for example:
Before:
# virsh dumpxml instance-<number> | grep xsaves
<feature policy='require' name='xsaves'/>
After:
# virsh dumpxml instance-<number> | grep xsaves
<feature policy='disable' name='xsaves'/>
* Ensure that nova is able to start *with* and *without* enable/disable
cpu flag changes.
* Ensure live migration works on both ways across the 2 hypervisors
*with* and *without* enable/disable cpu flag changes.
[ Regression Potential ]
* Regressions would likely manifest in the areas modified by the patches,
i.e., parsing the config file's cpu flags (on nova startup), generating
a VM's XML file (on nova VM start/creation), and also live migration.
* The patched packages have been evaluated/running in production for 2-3
months now, and live migration have been performed, without any issues.
[ Other Info ]
* The code changes had their callee-paths reviewed, and potential issues
were not identified.
* The patches are already present in Jammy and later.
[ Original Bug Description ]
The linux kernel upstream disabled XSAVES on AMD EPYC Rome CPUs ([1]). Upstream qemu shortly followed with a patch adding a CPU model version of EPYC-Rome without XSAVES ([2])
The change in the kernel has been backported to ubuntu focal ([3]).
Without further workarounds or the adapted CPU model in qemu this will lead to a situation were virtual machines with an EPYC-Rome CPU model created on hypervisors with newer EPYC CPUs will have the XSAVES flag enabled, thus preventing live migration to hypervisors with EPYC Rome CPUs were XSAVES is no longer available.
Therefore I would like to argue that the patch adapting the CPU model in qemu should also be backported to ubuntu focal.
[1]
https://lore.kernel.org/all/20230307174643.1240184-1-andrew.cooper3@citrix.com/
[2]
https://patchew.org/QEMU/20230524213748.8918-1-davydov-max@yandex-team.ru/
[3]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2023420
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nova/+bug/2048517/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list