[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid

Rodrigo Barbieri 1944759 at bugs.launchpad.net
Fri Jun 21 15:40:15 UTC 2024


** Description changed:

  * SRU DESCRIPTION BELOW *
  
  Nova has a race condition between resize_instance() compute manager call
  and the update_available_resources periodic job. If they overlap at the
  right place, when resize_instance calls finish_resize, then periodic job
  will not track the migration nor the instance on the source host. It
  causes that the PCPU allocation on the source host is dropped in the
  resource tracker (not in placement). Then when the resize is confirmed
  nova tries to free the pinned cpus again on the source host and fails
  with CPUUnpinningInvalid as they are already freed.
  
  I've pushed a reproduction test:
  https://review.opendev.org/c/openstack/nova/+/810763
  
  It is reproducible at least on master, xena, wallaby, and victoria
  
  ===============
  SRU DESCRIPTION
  ===============
  
  [Impact]
  
  Due to a race condition the tracking of pinned CPU resources can go off-
  sync causing "No valid host" errors while being unable to create new
  instances with CPU pinning, as the previous pinned CPUs were not marked
  as freed.
  
  Part of the reason is addressed in the fix for LP#1953359 where a
  migration context is not pointing to the proper node during the race
  condition window, resulting in a CPUPinningInvalid error. This fix
  complements LP#1953359 by addressing the improper tracking of resources
  that happens only when the resource tracker periodic job runs in the
  source node while the flavor registered corresponds to the one of the
  destination. That is solved by setting the instance.old_flavor so the
  CPU pinning resources are tracked properly.
  
  [Test case]
  
  The test case for this was already implemented on non-live functional
  tests upstream:
  
  in nova/tests/functional/libvirt/test_numa_servers.py:
  - test_resize_dedicated_policy_race_on_dest_bug_1953359
  - test_resize_confirm_bug_1944759
  - test_resize_revert_bug_1944759
  
  As this is a race condition it is very difficult to validate, even
  upstream, so the functional tests mock certain parts of the code to
  simulate the entire scope of the workflow. It is a non-live functional
  test, so it is more akin to a broader unit test.
  
+ The test case that will be run for this SRU is running the charmed-
+ openstack-tester [1] against the environment containing the upgraded
+ package (essentially as it would be in a point release SRU) and expect
+ the test to pass. Test run evidence will be attached to LP.
+ 
  [Regression Potential]
  
  The code is considered stable today in newer releases and the scope of
  the code affected is fairly limited. Given that it is a race condition
  that it is difficult to validate, despite the non-live functional tests,
  the regression potential is moderate.
  
  [Other Info]
  
  None.
+ 
+ [1] https://github.com/openstack-charmers/charmed-openstack-tester

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1944759

Title:
  [SRU] confirm resize fails with CPUUnpinningInvalid

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive ussuri series:
  Triaged
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  New
Status in nova package in Ubuntu:
  Invalid
Status in nova source package in Focal:
  Triaged

Bug description:
  * SRU DESCRIPTION BELOW *

  Nova has a race condition between resize_instance() compute manager
  call and the update_available_resources periodic job. If they overlap
  at the right place, when resize_instance calls finish_resize, then
  periodic job will not track the migration nor the instance on the
  source host. It causes that the PCPU allocation on the source host is
  dropped in the resource tracker (not in placement). Then when the
  resize is confirmed nova tries to free the pinned cpus again on the
  source host and fails with CPUUnpinningInvalid as they are already
  freed.

  I've pushed a reproduction test:
  https://review.opendev.org/c/openstack/nova/+/810763

  It is reproducible at least on master, xena, wallaby, and victoria

  ===============
  SRU DESCRIPTION
  ===============

  [Impact]

  Due to a race condition the tracking of pinned CPU resources can go
  off-sync causing "No valid host" errors while being unable to create
  new instances with CPU pinning, as the previous pinned CPUs were not
  marked as freed.

  Part of the reason is addressed in the fix for LP#1953359 where a
  migration context is not pointing to the proper node during the race
  condition window, resulting in a CPUPinningInvalid error. This fix
  complements LP#1953359 by addressing the improper tracking of
  resources that happens only when the resource tracker periodic job
  runs in the source node while the flavor registered corresponds to the
  one of the destination. That is solved by setting the
  instance.old_flavor so the CPU pinning resources are tracked properly.

  [Test case]

  The test case for this was already implemented on non-live functional
  tests upstream:

  in nova/tests/functional/libvirt/test_numa_servers.py:
  - test_resize_dedicated_policy_race_on_dest_bug_1953359
  - test_resize_confirm_bug_1944759
  - test_resize_revert_bug_1944759

  As this is a race condition it is very difficult to validate, even
  upstream, so the functional tests mock certain parts of the code to
  simulate the entire scope of the workflow. It is a non-live functional
  test, so it is more akin to a broader unit test.

  The test case that will be run for this SRU is running the charmed-
  openstack-tester [1] against the environment containing the upgraded
  package (essentially as it would be in a point release SRU) and expect
  the test to pass. Test run evidence will be attached to LP.

  [Regression Potential]

  The code is considered stable today in newer releases and the scope of
  the code affected is fairly limited. Given that it is a race condition
  that it is difficult to validate, despite the non-live functional
  tests, the regression potential is moderate.

  [Other Info]

  None.

  [1] https://github.com/openstack-charmers/charmed-openstack-tester

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1944759/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list