[Bug 1878548] Re: There are cases when masakari-hostmonitor will recognize online nodes as offline and send (in)appropriate notifications to Masakari
Edward Hope-Morley
1878548 at bugs.launchpad.net
Wed Dec 6 12:29:42 UTC 2023
Verified focal-proposed with the following output:
$ apt-cache policy masakari-monitors-common
masakari-monitors-common:
Installed: 9.0.0-0ubuntu0.20.04.2
Candidate: 9.0.0-0ubuntu0.20.04.2
Version table:
*** 9.0.0-0ubuntu0.20.04.2 500
500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
100 /var/lib/dpkg/status
9.0.0-0ubuntu0.20.04.1 500
500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
9.0.0~b3~git2020041013.e225e6d-0ubuntu1 500
500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
Tested fencing communication between a compute host and all
masakari/corosync units e.g.
# do this on all masakari units
juju run -a masakari -- sudo iptables -I INPUT -p tcp -s 10.0.0.10 --sport 3121 -j REJECT
# then wait for compute host 10.0.0.10 to get rebooted which it did although the notification was not sent but that feels unrelated to this patch/bug.
$ openstack notification list
# From below we confirm that only compute0 was rebooted
$ juju ssh nova-compute/0 uptime
12:20:32 up 24 min, 1 user, load average: 0.65, 0.82, 1.00
Connection to 10.0.0.10 closed.
$ juju ssh nova-compute/1 uptime
12:20:37 up 2:23, 1 user, load average: 0.70, 0.85, 1.02
Connection to 10.0.0.47 closed.
$ juju ssh nova-compute/2 uptime
12:20:41 up 2:23, 1 user, load average: 0.64, 0.71, 0.89
Connection to 10.0.0.31 closed.
** Tags removed: verification-needed verification-needed-focal
** Tags added: verification-done verification-done-focal
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1878548
Title:
There are cases when masakari-hostmonitor will recognize online nodes
as offline and send (in)appropriate notifications to Masakari
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Committed
Status in Ubuntu Cloud Archive victoria series:
Fix Released
Status in Ubuntu Cloud Archive wallaby series:
Fix Released
Status in masakari-monitors:
Fix Released
Status in masakari-monitors ussuri series:
Fix Released
Status in masakari-monitors victoria series:
Fix Released
Status in masakari-monitors wallaby series:
Fix Released
Status in masakari-monitors xena series:
Fix Released
Status in masakari-monitors package in Ubuntu:
Fix Released
Status in masakari-monitors source package in Focal:
Fix Committed
Bug description:
[Issue]
ComputeNodes are managed by pacemaker_remote in my environment.
When one ComputeNode is isolated in the network, masakari-hostmonitors on the other ComputeNodes will send failure notification about the isolated ComputeNode to masakari-api.
At that time, the isolated masakari-hostomonitor will recognize other ComputeNodes as offline. So it sends failure notification about online ComputeNodes.
As a result, masakari-engine runs the recovery procedure to online ComputeNodes.
[Cause]
The current masakari-hostmonitor can't determine whether or not it is isolated in the network if ComputeNodes are managed by pacemaker_remote.
masakari-hostmonitor with pacemaker(not remote) will wait until it is killed if it is isolated in the network. It is implemented in the following code.
<https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L398-L402>
But masakari-hostmonitor with pacemaker_remote won't determine if it is isolated.
<https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L93-L95>
[Solution]
The ComputeNode managed by pacemaker_remote should determine recognize itself as offline when it is isolated.
The state monitoring process should be skipped in that case.
See comment #11 for how yoctozepto managed to reproduce something
similar to the described.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1878548/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list