[Bug 1889765] Re: Notification processing failure causes notifications to get stuck

Ɓukasz Zemczak 1889765 at bugs.launchpad.net
Fri Jun 11 16:38:57 UTC 2021


Hello Mark, or anyone else affected,

Accepted masakari into focal-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/masakari/9.0.0-0ubuntu0.20.04.5 in
a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: masakari (Ubuntu Focal)
       Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-focal

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to masakari in Ubuntu.
https://bugs.launchpad.net/bugs/1889765

Title:
  Notification processing failure causes notifications to get stuck

Status in masakari:
  Fix Released
Status in masakari package in Ubuntu:
  Fix Released
Status in masakari source package in Focal:
  Fix Committed

Bug description:
  While testing compute node failover, I hit the following issue. Steps
  to reproduce:

  * disable fencing
  * create a VM
  * kill pacemaker-remote on the compute node running the VM, leave nova-compute running

  A notification is generated, and masakari disables the compute
  service. However, when it tries to evacuate the instance, it fails
  because nova-compute is still running. I suppose this is expected,
  since we haven't fenced the node. However, we see the following issue:

  2020-07-31 09:03:39.688 6 INFO masakari.engine.manager [req-11dbb1bf-35c5-4865-a565-88e701b0771e bdc8ea78517649048f945b4c93d0fd12 523ce2afdf824732b04d7ba419bc0508 - - -] Processing notification 20b2d9f6-3
  446-4571-90e0-6d0441fd1e5e of type: COMPUTE_HOST
  2020-07-31 09:03:40.233 6 INFO masakari.compute.nova [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Disable nova-compute on rested-man.maas
  2020-07-31 09:03:40.441 6 INFO masakari.engine.drivers.taskflow.host_failure [req-9deac789-7364-4c38-915d-8ebd7efaece2 nova - - - -] Sleeping 180 sec before starting recovery thread until nova recognizes 
  the node down.
  2020-07-31 09:06:40.460 6 INFO masakari.compute.nova [req-333cd067-12b7-4fa1-880c-5e4cf9f03d55 nova - - - -] Fetch Server list on rested-man.maas
  2020-07-31 09:06:41.221 6 INFO masakari.compute.nova [req-ad6b4326-401a-44f4-8c91-926e780d6864 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:41.819 6 INFO masakari.compute.nova [req-18caaf81-554e-47ab-8adb-d1fe1ee1be56 nova - - - -] Call get server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:42.415 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call lock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:43.047 6 INFO masakari.compute.nova [req-1b9fae9a-0983-4b98-a9f2-9399db65b1e2 nova - - - -] Call evacuate command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898 on host None
  2020-07-31 09:06:43.639 6 INFO masakari.compute.nova [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Call unlock server command for instance 671f4a8d-af0d-4797-bdb3-d95ee0269898
  2020-07-31 09:06:44.341 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transi
  tioned into state 'FAILURE' from state 'RUNNING'
  4 predecessors (most recent first):
    Flow 'post_tasks'
    |__Flow 'main_tasks'
       |__Flow 'pre_tasks'
          |__Flow 'instance_evacuate_engine': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     result = task.execute(**arguments)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 387, in execute
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     _do_evacuate(self.context, host_name, instance_list)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 368, in _do_evacuate
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver     message=msg)
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.341 6 ERROR masakari.engine.drivers.taskflow.driver 
  2020-07-31 09:06:44.345 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'EvacuateInstancesTask' (3e129d65-cce2-468f-869a-9372984ceee5) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.347 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'PrepareHAEnabledInstancesTask' (e35a80ed-38ea-4866-8549-0b07c5f07db4) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.359 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Task 'DisableComputeServiceTask' (9ef14618-9a0d-4281-b6eb-d6f83dfb5ea9) transitioned into state 'REVERTED' from state 'REVERTING'
  2020-07-31 09:06:44.361 6 WARNING masakari.engine.drivers.taskflow.driver [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Flow 'instance_evacuate_engine' (72f2b5a2-fcd3-4c01-b426-d5fa00f2fc01) transitioned into state 'REVERTED' from state 'RUNNING'
  2020-07-31 09:06:44.361 6 ERROR masakari.engine.manager [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Failed to process notification '20b2d9f6-3446-4571-90e0-6d0441fd1e5e'. Reason: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '671f4a8d-af0d-4797-bdb3-d95ee0269898' from host 'rested-man.maas'
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server [req-0ff038ca-d59a-4be9-aef9-6d04247f9e76 nova - - - -] Exception during message handling: IndexError: list index out of range
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 321, in process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     self._process_notification(context, notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 317, in _process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     do_process_notification(notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/utils.py", line 267, in inner
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 299, in do_process_notification
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     context, notification)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/manager.py", line 248, in _handle_notification_type_host
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     tb=tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 47, in notify_about_notification_update
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     fault, priority = _get_fault_and_priority_from_exc_and_tb(exception, tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/engine/utils.py", line 31, in _get_fault_and_priority_from_exc_and_tb
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     exception, tb)
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/masakari/notifications/objects/exception.py", line 39, in from_exc_and_traceback
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server     trace = inspect.trace()[-1]
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server IndexError: list index out of range
  2020-07-31 09:06:44.361 6 ERROR oslo_messaging.rpc.server 

  Masakari calls inspect.trace outside of an except clause, which
  returns an empty list. It then tries to get the last item of the list,
  which results in an IndexError.

  This leaves the notification in a running state, since masakari does
  not clean it up.

  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
  | notification_uuid                    | generated_time             | status  | type         | source_host_uuid                     | payload                                                                    |
  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+
  | 20b2d9f6-3446-4571-90e0-6d0441fd1e5e | 2020-07-31T09:02:27.000000 | running | COMPUTE_HOST | 0f706778-6765-408a-86d8-08ad1e7367a6 | {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'} |
  +--------------------------------------+----------------------------+---------+--------------+--------------------------------------+----------------------------------------------------------------------------+

  This prevents the segment host from being updated. I had to manually
  delete the notification from the DB.

  use masakari
  delete from notifications where notification_uuid='20b2d9f6-3446-4571-90e0-6d0441fd1e5e';

  There are probably two issues here.

  1. don't assume we are in an except clause
  2. move the notification to error state if processing fails

To manage notifications about this bug go to:
https://bugs.launchpad.net/masakari/+bug/1889765/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list