[Bug 1353939] Re: Rescue fails with 'Failed to terminate process: Device or resource busy' in the n-cpu log

Tue Jul 2 21:02:16 UTC 2019

Reviewed:  https://review.opendev.org/667436
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=75985e25bc147369efb90d4fa9f046631766c14c
Submitter: Zuul
Branch:    stable/rocky

commit 75985e25bc147369efb90d4fa9f046631766c14c
Author: Kashyap Chamarthy <kchamart at redhat.com>
Date:   Mon Feb 25 13:26:24 2019 +0100

    libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away.  Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete.  It now waits for 30 seconds for
    SIGKILL to work (instead of 5).  Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

    In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect.  And it matches the (more graceful) behaviour of
        libvirt v4.7.0.  This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

    Conflicts:
        nova/virt/libvirt/driver.py

        (Trivial conflict: Rocky didn't have the QEMU-native TLS feature
        yet.)

    Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <kchamart at redhat.com>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)

** Tags added: in-stable-rocky

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1353939

Title:
  Rescue fails with 'Failed to terminate process: Device or resource
  busy' in the n-cpu log

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive juno series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) juno series:
  Won't Fix
Status in OpenStack Compute (nova) kilo series:
  Fix Released
Status in nova package in Ubuntu:
  Invalid

Bug description:
  [Impact]

   * Users may sometimes fail to shutdown an instance if the associated qemu
     process is on uninterruptable sleep (typically IO).

  [Test Case]

   * 1. create some IO load in a VM
     2. look at the associated qemu, make sure it has STAT D in ps output
     3. shutdown the instance
     4. with the patch in place, nova will retry calling libvirt to shutdown
        the instance 3 times to wait for the signal to be delivered to the 
        qemu process.

  [Regression Potential]

   * None

  message: "Failed to terminate process" AND
  message:'InstanceNotRescuable' AND message: 'Exception during message
  handling' AND tags:"screen-n-cpu.txt"

  The above log stash-query reports back only the failed jobs, the 'Failed to terminate process' close other failed rescue tests,
  but tempest does not always reports them as an error at the end.

  message: "Failed to terminate process" AND tags:"screen-n-cpu.txt"

  Usual console log:
  Details: (ServerRescueTestJSON:test_rescue_unrescue_instance) Server 0573094d-53da-40a5-948a-747d181462f5 failed to reach RESCUE status and task state "None" within the required time (196 s). Current status: SHUTOFF. Current task state: None.

  http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-
  full/90726cb/console.html#_2014-08-07_03_50_26_520

  Usual n-cpu exception:
  http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-full/90726cb/logs/screen-n-cpu.txt.gz#_2014-08-07_03_32_02_855

  2014-08-07 03:32:02.855 ERROR oslo.messaging.rpc.dispatcher [req-39ce7a3d-5ceb-41f5-8f9f-face7e608bd1 ServerRescueTestJSON-2035684545 ServerRescueTestJSON-1017508309] Exception during message handling: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 408, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     payload)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return f(self, context, *args, **kw)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 292, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     pass
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 278, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 342, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 320, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     kwargs['instance'], e, sys.exc_info())
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 308, in decorated_function
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher   File "/opt/stack/new/nova/nova/compute/manager.py", line 3094, in rescue_instance
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher     reason=_("Driver Error: %s") % unicode(e))
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher InstanceNotRescuable: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
  2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1353939/+subscriptions