[Bug 1972028] [NEW] [SRU] _get_pci_passthrough_devices prone to race condition

Launchpad Bug Tracker 1972028 at bugs.launchpad.net
Tue Jan 7 16:29:26 UTC 2025


You have been subscribed to a public bug by Ubuntu Foundations Team Bug Bot (crichton):

[Impact]

Nova suffers from a race condition when it does live migrations of vms
with SRIOV ports whereby a pre-check of available ports and their
capabilities can error if one or more ports becomes unavailable during
the check. The fix backported here simply ignores libvirt errors when
checking device capabilities resulting in those that throw an error
being ignored.

[Test Plan]

Since the bug is a race condition it can be hard to reproduce but a
succession of live migrations between SRIOV capable nodes with a
reasonably large quantity of VFs should be a reasonable test.

* deploy OpenStack Yoga with SRIOV capable hardward
* create 10 vms with e.g. 5 sriov ports
* live migrate the vms between the hosts and check for the Traceback in /var/log/nova/nova-compute.log

[Regression Potential]
This patch is not anticipated to introduce any regressions.
-------------------------------------------------

At the moment, the `_get_pci_passthrough_devices` function is prone to
race conditions.

This specific code here calls `listCaps()`, however, it is possible that
the device has disappeared by the time on method has been called:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959

Which would result in the following traceback:

2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager [req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources for node <snip>.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most recent call last):
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 9946, in _update_available_resource_for_node
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     self.rt.update_available_resource(context, nodename,
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py", line 879, in update_available_resource
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 8937, in get_available_resource
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7663, in _get_pci_passthrough_devices
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     vdpa_devs = [
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7664, in <listcomp>
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     dev for dev in devices.values() if "vdpa" in dev.listCaps()
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File "/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in listCaps
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     raise libvirtError('virNodeDeviceListCaps() failed')
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager

I think the cleaner way is to loop over all the items and skip a device
if it raises an error that the device is not found.

** Affects: cloud-archive
     Importance: Undecided
         Status: New

** Affects: cloud-archive/yoga
     Importance: Undecided
         Status: New

** Affects: cloud-archive/zed
     Importance: Undecided
         Status: Fix Released

** Affects: nova
     Importance: Medium
     Assignee: Mohammed Naser (mnaser)
         Status: Fix Released

** Affects: nova (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: nova (Ubuntu Jammy)
     Importance: Undecided
         Status: New


** Tags: compute libvirt patch pci resource-tracker
-- 
[SRU] _get_pci_passthrough_devices prone to race condition
https://bugs.launchpad.net/bugs/1972028
You received this bug notification because you are a member of Ubuntu Sponsors, which is subscribed to the bug report.



More information about the Ubuntu-sponsors mailing list