[Bug 1821755] Re: [SRU] live migration break the anti-affinity policy of server group simultaneously

Rodrigo Barbieri 1821755 at bugs.launchpad.net
Wed Nov 17 21:13:17 UTC 2021


I just re-tested in focal-ussuri, there is no race condition, the fix
works. I'm going to narrow down further why it does not work in Train,
but I believe it is very unlikely that we will move forward with the
Train SRU.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1821755

Title:
  [SRU] live migration break the anti-affinity policy of server group
  simultaneously

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Committed
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  Fix Committed
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in OpenStack Compute (nova) victoria series:
  Fix Released
Status in OpenStack Compute (nova) wallaby series:
  Fix Released

Bug description:
  --------------------------------
  NOTE: SRU template at the bottom
  --------------------------------

  Description
  ===========
  If we live migrate two instance simultaneously, the instances will break the instance group policy.

  Steps to reproduce
  ==================
  OpenStack env with three compute nodes(node1, node2 and node3). Then we create two VMs(vm1, vm2) with the anti-affinity policy.
  At last, we live migrate two VMs simultaneously.

  Before live-migration, the VMs are located as followed:
  node1 -> vm1
  node2 -> vm2
  node3

  * nova live-migration vm1
  * nova live-migration vm2

  Expected result
  ===============
  Fail to live migrate vm1 and vm2.

  Actual result
  =============
  node1
  node2
  node3 -> vm1,vm2

  Environment
  ===========
  master branch of openstack

  As described above, the live migration could not check the in-progress
  live-migration and just select the host by scheduler filter. So that
  they are migrated to the same host.

  ----------------------------------------------------

  ===============
  SRU Description
  ===============

  [Impact]

  When performing multiple live migration, cold migration or resize
  simultaneously, the affinity or anti-affinity policy is violated,
  allowing the migrated VM to land in a host that conflicts with the
  policy.

  [Test case]

  1. Setting up the env

  1a. Deploy env with 5 compute nodes

  1b. Confirm that all nodes have the same CPU architecture (so live-
  migration works between them) either by running lscpu or "openstack
  hypervisor show <node>" on each of the nodes

  1c. Create anti-affinity policy

  openstack server group create anti-aff --policy anti-affinity

  1c. Create flavor

  openstack flavor create --vcpu 1 --ram 1024 --disk 0 --id 100 test-
  flavor

  1d. Create volumes

  openstack volume create --image cirros --size 1 vol1
  openstack volume create --source vol1 --size 1 vol2 && openstack volume create --source vol1 --size 1 vol3

  2. Prepare to reproduce the bug

  2a. Get group ID

  GROUP_ID=$(openstack server group show anti-aff -c id -f value)

  2b. Create VMs

  openstack server create --network private --volume vol1 --flavor 100
  --hint group=$GROUP_ID ins1 && openstack server create --network
  private --volume vol2 --flavor 100 --hint group=$GROUP_ID ins2 &&
  openstack server create --network private --volume vol3 --flavor 100
  --hint group=$GROUP_ID ins3

  2c. Confirm each one is in a different host by running "openstack
  server list --long" and take note of the hosts

  3. Reproducing the bug (Live migration)

  3a. Perform set of steps (2) if hasn't.

  3b. openstack server migrate ins1 --live-migration & openstack server
  migrate ins2 --live-migration & openstack server migrate  ins3 --live-
  migration

  3c. watch "openstack server list --long" until all migrations are
  finished

  3d. Confirm that at least 1 host is in the same host as another host.
  Otherwise, repeat steps 3a - 3c.

  4. Reproducing the bug (Cold Migration)

  4a. Perform set os steps (2) if hasn't

  4b. openstack server migrate ins1 & openstack server migrate ins2 &
  openstack server migrate ins3

  4c. watch "openstack server list --long" until all statuses are
  "VERIFY_RESIZE"

  4d. Confirm that at least 1 host is in the same host as another host.
  Otherwise, repeat steps 4a - 4c.

  4e. Confirm all the resizes running "openstack server resize confirm
  <vm>"

  5a. Install package that contains the fixed code on all compute nodes

  5b. Cleanup all the VMs

  6. Confirm fix (Live migration)

  6a. Perform steps 3a - 3c

  6b. Confirm there are no VMs in the same hosts nor VMs with ERROR
  status.

  6c. Confirm there are VMs that have ACTIVE status and did not move
  hosts. Otherwise, repeat step 6a.

  6d. Run "openstack server event list <vm-id>, then "openstack server
  event show <vm-id> <req-id>" for the live-migration event of the VMs
  assessed in step 6c. Confirm the "message" field is "error" and the
  traceback is part of the "compute_check_can_live_migrate_destination"
  or "compute_pre_live_migration" events with result=Error and the
  traceback ends in the _do_validation function. Repeat this step to
  capture both events.

  6e. Check the logs for messages related to the VMs assessed in step (6c), where:
  - For compute_check_can_live_migrate_destination: egrep -rnIi "MigrationPreCheckError: Migration pre-check error: Failed to validate instance group policy due to.*e9ec173a-4491-4541-9bd4-951692e48c8f.*Anti-affinity instance group policy was violated" /var/log/nova
  - For compute_pre_live_migration: grep -rnIi "RescheduledException_Remote: Build of instance c55889d9-6cbe-409a-b118-7b4a8d808972 was re-scheduled: Anti-affinity instance group policy was violated." /var/log/nova

  
  7. Confirm fix (Cold migration)

  7a. Perform steps 4a - 4c, while taking note of the the timestamp (by
  running $(date)) before running the migration command

  7b. Confirm there are no VMs in the same same hosts nor VMs with ERROR
  status. There should be VMs with "VERIFY_RESIZE" and "ACTIVE"
  statuses. If there are no ACTIVE instances, confirm the resizes and
  repeat step 7a.

  7c. For the ones that are ACTIVE, check logs for error messages. There
  should be message with error about "anti-affinity":

  egrep -rnIi "3e926491-d0dc-4611-8e87-75604c67f308.*Anti-affinity
  instance group policy was violated" /var/log/nova

  /var/log/nova/nova-compute.log:40797:2021-07-22 19:19:54.075 1692
  ERROR oslo_messaging.rpc.server nova.exception.RescheduledException:
  Build of instance 3e926491-d0dc-4611-8e87-75604c67f308 was re-
  scheduled: Anti-affinity instance group policy was violated.

  7d. Confirm that the log timestamp matches a few seconds after the
  migration command was issued.

  7e. Run "openstack server event list <vm-id>", then "openstack server
  event show <vm-id> <req-id>" for the migration event. Confirm the
  "message" field is "error" and the "events" field include a "No Valid
  Host" final message, with the "compute_prep_resize" event with
  result=Error and ending the traceback in the _do_validation function.

  [Regression Potential]

  Part of the new code path has been tested in upstream CI in happy
  migration paths. Concurrency has not been tested in the CI to trigger
  the error in a negative test. The exception handling code is executed
  only in case the exception is raised (in case of policy violation), so
  this code path is being tested manually as part of the upstream patch
  work and SRU.

  [Other Info]

  None

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1821755/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list