[Bug 2041519] Related fix proposed to nova (stable/2023.2)

Tue Dec 17 09:13:22 UTC 2024

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/937847

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/2041519

Title:
  Inventories of SR-IOV GPU VFs are impacted by allocations for other
  VFs

Status in Ubuntu Cloud Archive:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Jammy:
  New
Status in nova source package in Noble:
  New
Status in nova source package in Oracular:
  New

Bug description:
  This is hard to summarize the problem in a bug report title, my bad.

  Long story short, the case arrives if you start using nVidia SR-IOV next-gen GPUs like A100 which create Virtual Functions on the host, each of them supporting the same GPU types but with a specific amount of available mediated devices to be created equal to 1.
  If you're using other GPUs (like V100) and you're not running nvidia's sriov-manage to expose the VFs, please nevermind this bug, you shall not be impacted.

  So, say you have a A100 GPU card, before configuring Nova, you have to
  run the aforementioned sriov-manage script which will allocate 16
  virtual functions for the GPU. Each of those PCI adddresses will
  correspond to a Placement resource provider (if you configure Nova so)
  with an VGPU inventory with total=1.

  Example :
  https://paste.opendev.org/show/bVxrVLW3yOR3TPV2Lz3A/

  Sysfs shows the exact same thing on the nvidia-472 type I configured for :
  [stack at lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1

  Now, the problem arises when you're exhausting the number of mediated devices you can create.
  In the case of nvidia-472, which corresponds to nvidia's GRID A100-20C, you can create up to 2 VGPUs, ie. mediated devices.

  Accordingly, when Nova creates the 2 mediated devices automatically
  when booting an instance, and if *no* mediated devices are found
  available yet, then *all other* VFs that don't use those 2 mediated
  devices will have their available_instances value equal to 0 :

  [stack at lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm1
  (skipped)
  [stack at lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
  0
  1
  1
  1
  [stack at lenovo-sr655-01 nova]$ openstack server create --image cirros-0.6.2-x86_64-disk --flavor c1g --key-name mykey --network public vm2
  (skipped)
  [stack at lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/*/mdev_supported_types/nvidia-472/available_instances
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0

  No, when we look at the inventories for all VFs, we see that while it's normal to see 2 Resource Providers having their total to 1 (since we created a mdev, it's counted) and their usage to 1, that said it's not normal to see *other VFs* having a total of 1 and an usage of 0.

  [stack at lenovo-sr655-01 nova]$ for uuid in $(openstack resource provider list -f value -c uuid); do openstack resource provider inventory list $uuid -f value -c resource_class -c total -c used; done | grep VGPU
  VGPU 1 1
  VGPU 1 1
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0
  VGPU 1 0

  I eventually went down into the code and found the culprit :

  https://github.com/openstack/nova/blob/9c9cd3d9b6d1d1e6f62012cd8a86fd588fb74dc2/nova/virt/libvirt/driver.py#L9110-L9111

  Before this method is called, we correctly calculate the numbers that
  we get from libvirt, and all the non-used VFs have their total=0, but
  since we enter this conditional, we skip to update them.

  There are different ways to solve this problem :
   - we stop automatically creating mediated devices and ask operators to pre-allocate all mediated devices before starting nova-compute but there is a big operator impact (and they need to add some tooling)
   - we blindly remove the RP from the PlacementTree and let update_resource_providers() call in compute manager to try to update Placement with this new view. In that very particular case, we're sure that none of the RPs that have total=0 have allocations against them, so it shouldn't fail but this logic can be errorprone if we try to reproduce it elsewhere.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2041519/+subscriptions