[Bug 1889765] Re: Notification processing failure causes notifications to get stuck
Pedro GuimarĂ£es
1889765 at bugs.launchpad.net
Thu Jun 17 13:57:16 UTC 2021
Hi @james-page, I've ran the package in focal-proposed:
*** 9.0.0-0ubuntu0.20.04.5 500
500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
And I can see that now notifications do not get stuck in "running"
anymore, but move to "failed" when an error happens. That fixed the bug
for me.
** Tags removed: verification-ussuri-needed
** Tags added: verification-ussuri-done
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1889765
Title:
Notification processing failure causes notifications to get stuck
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Committed
Status in masakari:
Fix Released
Status in masakari package in Ubuntu:
Fix Released
Status in masakari source package in Focal:
Fix Committed
Bug description:
While testing compute node failover, I hit the following issue. Steps
to reproduce:
* disable fencing
* create a VM
* kill pacemaker-remote on the compute node running the VM, leave nova-compute running
A notification is generated, and masakari disables the compute
service. However, when it tries to evacuate the instance, it fails
because nova-compute is still running. I suppose this is expected,
since we haven't fenced the node. However, we see the following issue:
2020-07-31 09:03:39.688 6 INFO masakari.engine.manager [req-11dbb1bf-35c5-4865-a565-88e701b0771e bdc8ea78517649048f945b4c93d0fd12 523ce2afdf824732b04d7ba419bc0508 - - -] Processing notification 20b2d9f6-3
446-4571-90e0-6d0441fd1e5e of type: COMPUTE_HOST
2020-07-31 09:03:40.233 6 INFO masakari.compute.nova [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Disable nova-compute on rested-man.maas
2020-07-31 09:03:40.441 6 INFO masakari.engine.drivers.taskflow.host_failure [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Sleeping 180 sec before starting recovery thread until nova recognizes
the node down.
2020-07-31 09:06:40.460 6 INFO masakari.compute.nova [req-333cd067-12b7-4fa1-880c-5e4cf9f03d55 nova - - - -] Fetch Server list on rested-man.maas
2020-07-31 09:06:41.221 6 INFO masakari.compute.nova [req-ad6b4326-401a-44f4-8c91-926e780d6864 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
2020-07-31 09:06:41.819 6 INFO masakari.compute.nova [req-18caaf81-554e-47ab-8adb-d1fe1ee1be56 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
2020-07-31 09:06:42.415 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call lock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
2020-07-31 09:06:43.047 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call evacuate command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898 on host None
2020-07-31 09:06:43.639 6 INFO masakari.compute.nova [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Call unlock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
2020-07-31 09:06:44.341 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transi
tioned into state 'FAILURE' from state 'RUNNING'
4 predecessors (most recent first):
Flow 'post_tasks'
|__Flow 'main_tasks'
|__Flow 'pre_tasks'
|__Flow 'instance_evacuate_engine': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver File "/var/lib/kolla/venv/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver result = task.execute(**arguments)
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 387, in execute
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver _do_evacuate(self.context, host_name, instance_list)
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 368, in _do_evacuate
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver message=msg)
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver
2020-07-31 09:06:44.345 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transitioned into state 'REVERTED' from state 'REVERTING'
2020-07-31 09:06:44.347 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'PrepareHAEnabledInstancesTask' (e35a80ed-38ea-4866-8549-0b07c5f07db4) transitioned into state 'REVERTED' from state 'REVERTING'
2020-07-31 09:06:44.359 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'DisableComputeServiceTask' (9ef14618-9a0d-4281-b6eb-d6f83dfb5ea9) transitioned into state 'REVERTED' from state 'REVERTING'
2020-07-31 09:06:44.361 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Flow 'instance_evacuate_engine' (72f2b5a2-fcd3-4c01-b426-d5fa00f2fc01) transitioned into state 'REVERTED' from state 'RUNNING'
2020-07-31 09:06:44.361 6 ERROR masakari.engine.manager [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Failed to process notification '20b2d9f6-3446-4571-90e0-6d0441fd1e5e'. Reason: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Exception during message handling: IndexError: list index out of range
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 321, in process_notification
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server self._process_notification(context, notification)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 317, in _process_notification
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server do_process_notification(notification)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/utils.py", line 267, in inner
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 299, in do_process_notification
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server context, notification)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 248, in _handle_notification_type_host
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server tb=tb)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 47, in notify_about_notification_update
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server fault, priority = _get_fault_and_priority_from_exc_and_tb(exception, tb)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 31, in _get_fault_and_priority_from_exc_and_tb
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server exception, tb)
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/notifications/objects/exception.py", line 39, in from_exc_and_traceback
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server trace = inspect.trace()[-1]
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server IndexError: list index out of range
2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server
Masakari calls inspect.trace outside of an except clause, which
returns an empty list. It then tries to get the last item of the list,
which results in an IndexError.
This leaves the notification in a running state, since masakari does
not clean it up.
+--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
| notification_uuid | generated_time | status | type | source_host_uuid | payload |
+--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
| 20b2d9f6-3446-4571-90e0-6d0441fd1e5e | 2020-07-31T09:02:27.000000 | running | COMPUTE_HOST | 0f706778-6765-408a-86d8-08ad1e7367a6 | {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'} |
+--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
This prevents the segment host from being updated. I had to manually
delete the notification from the DB.
use masakari
delete from notifications where notification_uuid='20b2d9f6-3446-4571-90e0-6d0441fd1e5e';
There are probably two issues here.
1. don't assume we are in an except clause
2. move the notification to error state if processing fails
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1889765/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list