[Bug 2048517] Re: EPYC-Rome model without XSAVES may break live migration since the removal of the flag on the physical CPU

Mauricio Faria de Oliveira 2048517 at bugs.launchpad.net
Tue Aug 6 16:03:01 UTC 2024


The autopkgtests regression was due to a transient infrastructure issue,
and has cleared with a retry.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/2048517

Title:
  EPYC-Rome model without XSAVES may break live migration since the
  removal of the flag on the physical CPU

Status in nova package in Ubuntu:
  Fix Released
Status in qemu package in Ubuntu:
  Invalid
Status in nova source package in Focal:
  Fix Committed
Status in qemu source package in Focal:
  Invalid

Bug description:
  [ Impact ]

   * Live migration is increasingly being impacted by changes to CPU flags
     (e.g., 'xsaves' disabled on AMD EPYC; PKRU/'xsave' behavior changes), 
     which prevents migration on otherwise identical hypervisors, but the
     only difference is a CPU flag (i.e., source hypervisor still has flag
     enabled; destination hypervisor had flag disabled on a kernel update).
      
   * These CPU flags updates require changes to CPU model definitions in
     several places (qemu, libvirt, and nova if openstack is being used),
     which is a lot of overhead for each subtle variation that may appear.
     
   * Fortunately, it's possible to reduce the changes required by allowing
     nova to customize CPU flags to enable/disable _on top_ of a CPU model
     definition (e.g., the same AMD EPYC CPU model with 'xsaves' disabled).

   * This change is present in Jammy and later, and is backward compatible
     with the existing config files, as the (new) enable/disable operators 
     are an optional prefix to existing flags (e.g., '-xsaves' or '+xsaves').

  [ Test Plan ]

   * Deploy Openstack with 2 hypervisors (or more), and configure nova.conf
     with a cpu_model and cpu_extra_flags to disable/enable, for example:
     
     # grep cpu_model /etc/nova/nova.conf
     cpu_model = EPYC-Rome
     cpu_model_extra_flags = -xsaves

   * Start a VM before/after the package upgrade (focal-proposed), checking
     the VM XML for that flag (e.g., policy change from require to disable);
     for example:
     
     Before:
     
     # virsh dumpxml instance-<number> | grep xsaves
     <feature policy='require' name='xsaves'/>
     
     After:
     
     # virsh dumpxml instance-<number> | grep xsaves 
     <feature policy='disable' name='xsaves'/>
     
   * Ensure that nova is able to start *with* and *without*  enable/disable
     cpu flag changes.
   
   * Ensure live migration works on both ways across the 2 hypervisors 
     *with* and *without* enable/disable cpu flag changes.
     
     
  [ Regression Potential ]

   * Regressions would likely manifest in the areas modified by the patches,
     i.e., parsing the config file's cpu flags (on nova startup), generating
     a VM's XML file (on nova VM start/creation), and also live migration.

   * The patched packages have been evaluated/running in production for 2-3
     months now, and live migration have been performed, without any issues.
     
  [ Other Info ]

   * The code changes had their callee-paths reviewed, and potential issues
     were not identified.
     
   * The patches are already present in Jammy and later.

  [ Original Bug Description ]
  The linux kernel upstream disabled XSAVES on AMD EPYC Rome CPUs ([1]). Upstream qemu shortly followed with a patch adding a CPU model version of EPYC-Rome without XSAVES ([2])
  The change in the kernel has been backported to ubuntu focal ([3]).

  Without further workarounds or the adapted CPU model in qemu this will lead to a situation were virtual machines with an EPYC-Rome CPU model created on hypervisors with newer EPYC CPUs will have the XSAVES flag enabled, thus preventing live migration to hypervisors with EPYC Rome CPUs were XSAVES is no longer available.
  Therefore I would like to argue that the patch adapting the CPU model in qemu should also be backported to ubuntu focal.

  [1]
  https://lore.kernel.org/all/20230307174643.1240184-1-andrew.cooper3@citrix.com/

  [2]
  https://patchew.org/QEMU/20230524213748.8918-1-davydov-max@yandex-team.ru/

  [3]
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2023420

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nova/+bug/2048517/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list