[Bug 1471022] Re: [SRU] race between nova-compute and neutron-ovs-cleanup

Edward Hope-Morley edward.hope-morley at canonical.com
Fri Jul 3 18:42:21 UTC 2015


** Description changed:

+ [Impact]
+ 
  This issue appears to be a consequence of
  https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we
  added a 'wait-for-state running' to the nova-compute upstart so as to
  ensure that neutron-ovs-cleanup has finished before nova-compute starts.
  
  I have started to spot, however, that on some hosts (metal only) there
  is now a race between the two whereby nova-compute sometimes fails to
  start on system boot/reboot with the following in /var/log/upstart/nova-
  compute.log:
  
  ...
  libvirt-bin stop/waiting
  wait-for-state stop/waiting
  neutron-ovs-cleanup start/pre-start, process 3084
  start: Job failed to start
  
  If I manually restart nova-compute all is fine. So this looks like a
  race between nova-compute's wait-for-state and neutron-ovs-cleanup's
  pre-start -> start/running.
+ 
+ The proposed solution here is add some retry logic to nova-compute
+ upstart job to tolerate neutron-ovs-cleanup not being able to start yet.
+ We, therefore, allow a certain number of retries, every other with an
+ incremented delay, before giving up and allowing nova-compute to start
+ anyway. If ovs-cleanup failed to start after what is a failry liberal
+ retry period, it is assumed to have failed altogether this making is
+ safe(ish) to start nova-compute.
+ 
+ [Test Case]
+ 
+ In one terminal (as root) do:
+ service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart
+ 
+ In another do:
+ sudo tail -F /var/log/upstart/nova-compute.log
+ 
+ Observe the retries occurring
+ 
+ Then do 'sudo service openvswitch-switch start' and observe nova-compute
+ retry and succeed.
+ 
+ [Regression Potential]
+ 
+  * If openvswitch-switch does not start within the max retries and
+ intervals nova-compute will start anyway and of ovs-cleanup were at some
+ point to run one would see the behaviour that LP 1420572 was intended to
+ resolve. It does not seem to make sense to wait indefinitely for ovs-
+ cleanup to be up and the coded interval is pretty liberal and should be
+ plenty enough.

** Changed in: nova (Ubuntu Trusty)
       Status: New => In Progress

** Changed in: nova (Ubuntu Utopic)
       Status: New => In Progress

** Changed in: nova (Ubuntu Vivid)
       Status: New => In Progress

** Changed in: nova (Ubuntu Trusty)
     Assignee: (unassigned) => Edward Hope-Morley (hopem)

** Changed in: nova (Ubuntu Utopic)
     Assignee: (unassigned) => Edward Hope-Morley (hopem)

** Changed in: nova (Ubuntu Vivid)
     Assignee: (unassigned) => Edward Hope-Morley (hopem)

** Description changed:

  [Impact]
  
  This issue appears to be a consequence of
  https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we
  added a 'wait-for-state running' to the nova-compute upstart so as to
  ensure that neutron-ovs-cleanup has finished before nova-compute starts.
  
  I have started to spot, however, that on some hosts (metal only) there
  is now a race between the two whereby nova-compute sometimes fails to
  start on system boot/reboot with the following in /var/log/upstart/nova-
  compute.log:
  
  ...
  libvirt-bin stop/waiting
  wait-for-state stop/waiting
  neutron-ovs-cleanup start/pre-start, process 3084
  start: Job failed to start
  
  If I manually restart nova-compute all is fine. So this looks like a
  race between nova-compute's wait-for-state and neutron-ovs-cleanup's
  pre-start -> start/running.
  
  The proposed solution here is add some retry logic to nova-compute
  upstart job to tolerate neutron-ovs-cleanup not being able to start yet.
  We, therefore, allow a certain number of retries, every other with an
  incremented delay, before giving up and allowing nova-compute to start
  anyway. If ovs-cleanup failed to start after what is a failry liberal
- retry period, it is assumed to have failed altogether this making is
+ retry period, it is assumed to have failed altogether thus making is
  safe(ish) to start nova-compute.
  
  [Test Case]
  
  In one terminal (as root) do:
  service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart
  
  In another do:
  sudo tail -F /var/log/upstart/nova-compute.log
  
  Observe the retries occurring
  
  Then do 'sudo service openvswitch-switch start' and observe nova-compute
  retry and succeed.
  
  [Regression Potential]
  
-  * If openvswitch-switch does not start within the max retries and
+  * If openvswitch-switch does not start within the max retries and
  intervals nova-compute will start anyway and of ovs-cleanup were at some
  point to run one would see the behaviour that LP 1420572 was intended to
  resolve. It does not seem to make sense to wait indefinitely for ovs-
  cleanup to be up and the coded interval is pretty liberal and should be
  plenty enough.

** Description changed:

  [Impact]
  
  This issue appears to be a consequence of
  https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we
  added a 'wait-for-state running' to the nova-compute upstart so as to
  ensure that neutron-ovs-cleanup has finished before nova-compute starts.
  
  I have started to spot, however, that on some hosts (metal only) there
  is now a race between the two whereby nova-compute sometimes fails to
  start on system boot/reboot with the following in /var/log/upstart/nova-
  compute.log:
  
  ...
  libvirt-bin stop/waiting
  wait-for-state stop/waiting
  neutron-ovs-cleanup start/pre-start, process 3084
  start: Job failed to start
  
  If I manually restart nova-compute all is fine. So this looks like a
  race between nova-compute's wait-for-state and neutron-ovs-cleanup's
  pre-start -> start/running.
  
  The proposed solution here is add some retry logic to nova-compute
  upstart job to tolerate neutron-ovs-cleanup not being able to start yet.
  We, therefore, allow a certain number of retries, every other with an
  incremented delay, before giving up and allowing nova-compute to start
  anyway. If ovs-cleanup failed to start after what is a failry liberal
  retry period, it is assumed to have failed altogether thus making is
  safe(ish) to start nova-compute.
  
  [Test Case]
  
  In one terminal (as root) do:
  service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart
  
  In another do:
  sudo tail -F /var/log/upstart/nova-compute.log
  
  Observe the retries occurring
  
  Then do 'sudo service openvswitch-switch start' and observe nova-compute
  retry and succeed.
  
  [Regression Potential]
  
-  * If openvswitch-switch does not start within the max retries and
+ If openvswitch-switch does not start within the max retries and
  intervals nova-compute will start anyway and of ovs-cleanup were at some
  point to run one would see the behaviour that LP 1420572 was intended to
  resolve. It does not seem to make sense to wait indefinitely for ovs-
  cleanup to be up and the coded interval is pretty liberal and should be
  plenty enough.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1471022

Title:
  [SRU] race between nova-compute and neutron-ovs-cleanup

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1471022/+subscriptions



More information about the Ubuntu-server-bugs mailing list