[Bug 1889765] Re: Notification processing failure causes notifications to get stuck

Pedro GuimarĂ£es 1889765 at bugs.launchpad.net
Thu Jun 17 13:57:16 UTC 2021


Hi @james-page, I've ran the package in focal-proposed:

 *** 9.0.0-0ubuntu0.20.04.5 500
        500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages

And I can see that now notifications do not get stuck in "running"
anymore, but move to "failed" when an error happens. That fixed the bug
for me.

** Tags removed: verification-ussuri-needed
** Tags added: verification-ussuri-done

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1889765

Title:
  Notification processing failure causes notifications to get stuck

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in masakari:
  Fix Released
Status in masakari package in Ubuntu:
  Fix Released
Status in masakari source package in Focal:
  Fix Committed

Bug description:
  While testing compute node failover, I hit the following issue. Steps
  to reproduce:

  * disable fencing
  * create a VM
  * kill pacemaker-remote on the compute node running the VM, leave nova-compute running

  A notification is generated, and masakari disables the compute
  service. However, when it tries to evacuate the instance, it fails
  because nova-compute is still running. I suppose this is expected,
  since we haven't fenced the node. However, we see the following issue:

  2020-07-31 09:03:39.688 6 INFO masakari.engine.manager [req-11dbb1bf-35c5-4865-a565-88e701b0771e bdc8ea78517649048f945b4c93d0fd12 523ce2afdf824732b04d7ba419bc0508 - - -] Processing notification 20b2d9f6-3
  446-4571-90e0-6d0441fd1e5e of type: COMPUTE_HOST
  2020-07-31 09:03:40.233 6 INFO masakari.compute.nova [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Disable nova-compute on rested-man.maas
  2020-07-31 09:03:40.441 6 INFO masakari.engine.drivers.taskflow.host_failure [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Sleeping 180 sec before starting recovery thread until nova recognizes 
  the node down.
  2020-07-31 09:06:40.460 6 INFO masakari.compute.nova [req-333cd067-12b7-4fa1-880c-5e4cf9f03d55 nova - - - -] Fetch Server list on rested-man.maas
  2020-07-31 09:06:41.221 6 INFO masakari.compute.nova [req-ad6b4326-401a-44f4-8c91-926e780d6864 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:41.819 6 INFO masakari.compute.nova [req-18caaf81-554e-47ab-8adb-d1fe1ee1be56 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:42.415 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call lock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:43.047 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call evacuate command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898 on host None
  2020-07-31 09:06:43.639 6 INFO masakari.compute.nova [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Call unlock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:44.341 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transi
  tioned into state 'FAILURE' from state 'RUNNING'
  4 predecessors (most recent first):
    Flow 'post_tasks'
    |__Flow 'main_tasks'
       |__Flow 'pre_tasks'
          |__Flow 'instance_evacuate_engine': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     result = task.execute(**arguments)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 387, in execute
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     _do_evacuate(self.context, host_name, instance_list)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 368, in _do_evacuate
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     message=msg)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver 
  2020-07-31 09:06:44.345 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.347 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'PrepareHAEnabledInstancesTask' (e35a80ed-38ea-4866-8549-0b07c5f07db4) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.359 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'DisableComputeServiceTask' (9ef14618-9a0d-4281-b6eb-d6f83dfb5ea9) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.361 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Flow 'instance_evacuate_engine' (72f2b5a2-fcd3-4c01-b426-d5fa00f2fc01) transitioned into state 'REVERTED' from state 'RUNNING'
  2020-07-31 09:06:44.361 6 ERROR masakari.engine.manager [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Failed to process notification '20b2d9f6-3446-4571-90e0-6d0441fd1e5e'. Reason: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Exception during message handling: IndexError: list index out of range
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 321, in process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     self._process_notification(context, notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 317, in _process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     do_process_notification(notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/utils.py", line 267, in inner
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 299, in do_process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     context, notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 248, in _handle_notification_type_host
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     tb=tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 47, in notify_about_notification_update
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     fault, priority = _get_fault_and_priority_from_exc_and_tb(exception, tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 31, in _get_fault_and_priority_from_exc_and_tb
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     exception, tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/notifications/objects/exception.py", line 39, in from_exc_and_traceback
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     trace = inspect.trace()[-1]
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server IndexError: list index out of range
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server 

  Masakari calls inspect.trace outside of an except clause, which
  returns an empty list. It then tries to get the last item of the list,
  which results in an IndexError.

  This leaves the notification in a running state, since masakari does
  not clean it up.

  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
  | notification_uuid                    | generated_time             | status  | type         | source_host_uuid                     | payload                                                                    |
  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
  | 20b2d9f6-3446-4571-90e0-6d0441fd1e5e | 2020-07-31T09:02:27.000000 | running | COMPUTE_HOST | 0f706778-6765-408a-86d8-08ad1e7367a6 | {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'} |
  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+

  This prevents the segment host from being updated. I had to manually
  delete the notification from the DB.

  use masakari
  delete from notifications where notification_uuid='20b2d9f6-3446-4571-90e0-6d0441fd1e5e';

  There are probably two issues here.

  1. don't assume we are in an except clause
  2. move the notification to error state if processing fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1889765/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list