[Bug 1930361] Fix merged to masakari-monitors (stable/wallaby)
OpenStack Infra
1930361 at bugs.launchpad.net
Tue Jul 27 07:17:40 UTC 2021
Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802351
Committed: https://opendev.org/openstack/masakari-monitors/commit/9ae886e7428e61dfc6a29ec65b0f6836d2648326
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit 9ae886e7428e61dfc6a29ec65b0f6836d2648326
Author: sue <sugar-2008 at 163.com>
Date: Wed Jun 2 16:38:05 2021 +0800
Fix hostmonitor hanging forever after certain exceptions
The hostmonitor, like other Masakari monitors, starts as an
Oslo service (based on eventlet). The main thread is supposed
to run a loop that has an internal wait mechanism (instead of
reusing periodic_tasks from oslo_service). However, the loop
could be broken, if an unexpected exception appeared, and it
never ran again but the process was still alive (due to
oslo_service not stopping). The example mentioned in the bug
report is about unavailability of the Masakari API (and/or
Keystone API) before notification sending. This exception is
not caught early because SendNotification._make_client is
called outside of the try block (unlike the actual notification
sending). The exception bubbles up and stops the main loop,
leaving a useless hostmonitor process. The user is unaware
unless they notice the logs are no longer growing.
While the general design begs for a revamp (we might get away
with that by using Consul in the first place), the easy fix is
to prevent exceptions breaking the loop completely so that the
hostmonitor can continue to work and try to regain health.
At the very least it will keep posting ERROR messages in the log
which is more likely to be spotted in comparison to lack of logs
(which is, unfortunately, less commonly considered an alerting
situation).
This change also fixes, adapts and robustifies the two relevant
unit tests.
Closes-Bug: #1930361
Co-Authored-By: Radosław Piliszek <radoslaw.piliszek at gmail.com>
Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
(cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to masakari-monitors in Ubuntu.
https://bugs.launchpad.net/bugs/1930361
Title:
hostmonitor hangs after notifications send failed
Status in masakari-monitors:
Fix Released
Status in masakari-monitors ussuri series:
Fix Committed
Status in masakari-monitors victoria series:
Fix Committed
Status in masakari-monitors wallaby series:
Fix Committed
Status in masakari-monitors xena series:
Fix Released
Status in masakari-monitors package in Ubuntu:
Confirmed
Bug description:
In an env, we found one hostmonitor didn't log anymore after send host
failure notification failed.
I noticed that in the monitor_hosts it will exit if once it catch some
exception. So there is risk, that if one host down later, no recovery
will be triggered.
See comment #5 for a detailed analysis.
To manage notifications about this bug go to:
https://bugs.launchpad.net/masakari-monitors/+bug/1930361/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list