[Bug 1837635] Re: HA router state change from "standby" to "master" should be delayed
OpenStack Infra
1837635 at bugs.launchpad.net
Wed Apr 22 22:32:54 UTC 2020
Reviewed: https://review.opendev.org/721243
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2d849c6fee4fdf14e0ecc5242f6c9cc12aae8cbc
Submitter: Zuul
Branch: stable/rocky
commit 2d849c6fee4fdf14e0ecc5242f6c9cc12aae8cbc
Author: Rodolfo Alonso Hernandez <ralonsoh at redhat.com>
Date: Wed Jul 24 11:17:19 2019 +0000
Refactor the L3 agent batch notifier
This patch is the first one of a series of patches improving how the L3
agents update the router HA state to the Neutron server.
This patch partially reverts the previous patch [1]. When the batch
notifier sends events, it calls the callback method passed during the
initialization, in this case AgentMixin.notify_server. The batch
notifier spawns a new thread in charge of sending the notifications and
then wait the specified "batch_interval" time. If the callback method is
not synchronous with the notify thread execution (what [1] implemented),
the thread can finish while the RPC client is still sending the
HA router states. If another HA state update is received, then both
updates can be executed at the same time. It is possible then that a new
router state can be overwritten with an old one still not sent or
processed.
The batch notifier is refactored, to improve what initally was
implemented [2] and then updated [3]. Currently, each new event thread
can update the "pending_events" list. Then, a new thread is spawned to
process this event list. This thread decouples the current execution
from the calling thread, making the event processing a non-blocking
process.
But with the current implementation, each new process will spawn a new
thread, synchronized with the previous and new ones (using a
synchronized decorator). That means, during the batch interval time, the
system can have as many threads waiting as new events received. Those
threads will end secuentially when the previous threads end the batch
interval sleep time.
Instead of this, this patch receives and enqueue each new event and
allows only one thread to be alive while processing the event list. If
at the end of the processing loop new events are stored, the thread will
process then.
[1] I3f555a0c78fbc02d8214f12b62c37d140bc71da1
[2] I2f8cf261f48bdb632ac0bd643a337290b5297fce
[3] I82f403441564955345f47877151e0c457712dd2f
Partial-Bug: #1837635
Change-Id: I20cfa1cf5281198079f5e0dbf195755abc919581
(cherry picked from commit 8b7d2c8a93fdf69a828f14bd527d8f132b27bc6e)
** Tags added: in-stable-rocky
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1837635
Title:
HA router state change from "standby" to "master" should be delayed
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive queens series:
In Progress
Status in Ubuntu Cloud Archive rocky series:
Fix Committed
Status in Ubuntu Cloud Archive stein series:
New
Status in neutron:
Fix Released
Bug description:
Currently, when a HA state change occurs, the agent execute a series
of actions [1]: updates the metadata proxy, updates the prefix
delegation, executed L3 extension "ha_state_change" methods, updates
the radvd status and notifies this to the server.
When, in a system with more than two routers (one in "active" mode and
the others in "standby"), a switch-over is done, the "keepalived"
process [2] in each "standby" server will set the virtual IP in the HA
interface and advert it. In case that other router HA interface has
the same priority (by default in Neutron, the HA instances of the same
router ID will have the same priority, 50) but higher IP [3], the HA
interface of this instance will have the VIPs and routes deleted and
will become "standby" again. E.g.: [4]
In some cases, we have detected that when the master controller is
rebooted, the change from "standby" to "master" of the other two
servers is detected, but the change from "master" to "standby" of the
server with lower IP (as commented before) is not registered by the
server, because the Neutron server is still not accessible (the master
controller was rebooted). This status change, sometimes, is lost. This
is the situation when both "standby" servers become "master" but the
"master"-"standby" transition of one of them is lost.
1) INITIAL STATUS
(overcloud) [stack at undercloud-0 ~]$ neutron l3-agent-list-hosting-router router
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True | :-) | standby |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True | :-) | standby |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True | :-) | active |
+--------------------------------------+--------------------------+----------------+-------+----------+
2) CONTROLLER 1 REBOOTED
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True | :-) | active |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True | :-) | active |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+
The aim of this bug is to make public this problem and propose a patch to delay the transition from "standby" to "master" to let keepalived, among all the instances running in the HA servers, to decide which one of them is the "master" server.
[1] https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
[2] https://www.keepalived.org/
[3] This method is used by keepalived to define which router is predominant and must be master.
[4] http://paste.openstack.org/show/754760/
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1837635/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list