[Bug 1353939] Re: Rescue fails with 'Failed to terminate process: Device or resource busy' in the n-cpu log
OpenStack Infra
1353939 at bugs.launchpad.net
Tue Jul 2 21:02:16 UTC 2019
Reviewed: https://review.opendev.org/667436
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=75985e25bc147369efb90d4fa9f046631766c14c
Submitter: Zuul
Branch: stable/rocky
commit 75985e25bc147369efb90d4fa9f046631766c14c
Author: Kashyap Chamarthy <kchamart at redhat.com>
Date: Mon Feb 25 13:26:24 2019 +0100
libvirt: Rework 'EBUSY' (SIGKILL) error handling code path
Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
_destroy()) handled the case where a QEMU process "refuses to die" within
a given timeout period set by libvirt.
Originally, libvirt sent SIGTERM (allowing the process to clean-up
resources), then waited 10 seconds, if the guest didn't go away. Then
it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
take effect.
From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
for a guest hard shutdown to complete. It now waits for 30 seconds for
SIGKILL to work (instead of 5). Also, additional wait time is added if
there are assigned PCI devices, as some of those tend to slow things
down.
In this change:
- Increment the counter to retry the _destroy() call from 3 to 6, thus
increasing the total time from 15 to 30 seconds, before SIGKILL
takes effect. And it matches the (more graceful) behaviour of
libvirt v4.7.0. This also gives breathing room for Nova instances
running in environments with large compute nodes with high instance
creation or delete churn, where the current timout may not be
sufficient.
- Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
than 4.7.0.
[1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
(process: wait longer 5->30s on hard shutdown)
[2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
wait longer on kill per assigned Hostdev")
Conflicts:
nova/virt/libvirt/driver.py
(Trivial conflict: Rocky didn't have the QEMU-native TLS feature
yet.)
Related-bug: #1353939
Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
Signed-off-by: Kashyap Chamarthy <kchamart at redhat.com>
(cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)
** Tags added: in-stable-rocky
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1353939
Title:
Rescue fails with 'Failed to terminate process: Device or resource
busy' in the n-cpu log
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive juno series:
Fix Released
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) juno series:
Won't Fix
Status in OpenStack Compute (nova) kilo series:
Fix Released
Status in nova package in Ubuntu:
Invalid
Bug description:
[Impact]
* Users may sometimes fail to shutdown an instance if the associated qemu
process is on uninterruptable sleep (typically IO).
[Test Case]
* 1. create some IO load in a VM
2. look at the associated qemu, make sure it has STAT D in ps output
3. shutdown the instance
4. with the patch in place, nova will retry calling libvirt to shutdown
the instance 3 times to wait for the signal to be delivered to the
qemu process.
[Regression Potential]
* None
message: "Failed to terminate process" AND
message:'InstanceNotRescuable' AND message: 'Exception during message
handling' AND tags:"screen-n-cpu.txt"
The above log stash-query reports back only the failed jobs, the 'Failed to terminate process' close other failed rescue tests,
but tempest does not always reports them as an error at the end.
message: "Failed to terminate process" AND tags:"screen-n-cpu.txt"
Usual console log:
Details: (ServerRescueTestJSON:test_rescue_unrescue_instance) Server 0573094d-53da-40a5-948a-747d181462f5 failed to reach RESCUE status and task state "None" within the required time (196 s). Current status: SHUTOFF. Current task state: None.
http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-
full/90726cb/console.html#_2014-08-07_03_50_26_520
Usual n-cpu exception:
http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-full/90726cb/logs/screen-n-cpu.txt.gz#_2014-08-07_03_32_02_855
2014-08-07 03:32:02.855 ERROR oslo.messaging.rpc.dispatcher [req-39ce7a3d-5ceb-41f5-8f9f-face7e608bd1 ServerRescueTestJSON-2035684545 ServerRescueTestJSON-1017508309] Exception during message handling: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 408, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher payload)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 292, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher pass
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 278, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 342, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 320, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 308, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 3094, in rescue_instance
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher reason=_("Driver Error: %s") % unicode(e))
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher InstanceNotRescuable: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1353939/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list