[Bug 1837635] Related fix merged to neutron (master)
OpenStack Infra
1837635 at bugs.launchpad.net
Tue Dec 17 14:17:14 UTC 2024
Reviewed: https://review.opendev.org/c/openstack/neutron/+/937758
Committed: https://opendev.org/openstack/neutron/commit/a3689956dde80b9639a3e805257ac02e1044a4c2
Submitter: "Zuul (22348)"
Branch: master
commit a3689956dde80b9639a3e805257ac02e1044a4c2
Author: yatinkarel <ykarel at redhat.com>
Date: Mon Dec 16 09:59:52 2024 +0530
Revert "[HA] Do not add initial state change delay in HA router"
This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.
The fix of bug #1945512 reintroduced bug #1837635 as after
the initial backup state ha router can transition to
'primary' state on multiple hosts and due to this
delay multiple routers get into 'active' ha_state even
if one of the host quickly transition to backup after
the primary state.
The issue got visible since ha router fullstack tests
are added as part of [1].
[1] https://review.opendev.org/c/openstack/neutron/+/917429
Related-Bug: #1837635
Related-Bug: #1945512
Related-Bug: #2083609
Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1837635
Title:
HA router state change from "standby" to "master" should be delayed
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive queens series:
Fix Released
Status in Ubuntu Cloud Archive rocky series:
Fix Released
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in neutron:
Fix Released
Bug description:
Currently, when a HA state change occurs, the agent execute a series
of actions [1]: updates the metadata proxy, updates the prefix
delegation, executed L3 extension "ha_state_change" methods, updates
the radvd status and notifies this to the server.
When, in a system with more than two routers (one in "active" mode and
the others in "standby"), a switch-over is done, the "keepalived"
process [2] in each "standby" server will set the virtual IP in the HA
interface and advert it. In case that other router HA interface has
the same priority (by default in Neutron, the HA instances of the same
router ID will have the same priority, 50) but higher IP [3], the HA
interface of this instance will have the VIPs and routes deleted and
will become "standby" again. E.g.: [4]
In some cases, we have detected that when the master controller is
rebooted, the change from "standby" to "master" of the other two
servers is detected, but the change from "master" to "standby" of the
server with lower IP (as commented before) is not registered by the
server, because the Neutron server is still not accessible (the master
controller was rebooted). This status change, sometimes, is lost. This
is the situation when both "standby" servers become "master" but the
"master"-"standby" transition of one of them is lost.
1) INITIAL STATUS
(overcloud) [stack at undercloud-0 ~]$ neutron l3-agent-list-hosting-router router
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True | :-) | standby |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True | :-) | standby |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True | :-) | active |
+--------------------------------------+--------------------------+----------------+-------+----------+
2) CONTROLLER 1 REBOOTED
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True | :-) | active |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True | :-) | active |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+
The aim of this bug is to make public this problem and propose a patch to delay the transition from "standby" to "master" to let keepalived, among all the instances running in the HA servers, to decide which one of them is the "master" server.
[1] https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
[2] https://www.keepalived.org/
[3] This method is used by keepalived to define which router is predominant and must be master.
[4] http://paste.openstack.org/show/754760/
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1837635/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list