[Bug 1731595] Re: L3 HA: multiple agents are active at the same time
Corey Bryant
corey.bryant at canonical.com
Wed Jan 17 17:14:44 UTC 2018
It seems as if this bug surfaces due to load issues. While the fix
provided by Venkata (https://review.openstack.org/#/c/522641/) should
help clean things up at the time of l3 agent restart, issues seem to
come back later down the line in some circumstances. xavpaice mentioned
he saw multiple routers active at the same time when they had 464
routers configured on 3 neutron gateway hosts using L3HA, and each
router was scheduled to all 3 hosts. However, jhebden mentions that
things seem stable at the 400 L3HA router mark, and it's worth noting
this is the same deployment that xavpaice was referring to.
It seems to me that something is being pushed to it's limit, and
possibly once that limit is hit, master router advertisements aren't
being received, causing a new master to be elected. If this is the case
it would be great to get to the bottom of what resource is getting
constrained.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1731595
Title:
L3 HA: multiple agents are active at the same time
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive mitaka series:
Fix Committed
Status in Ubuntu Cloud Archive newton series:
Fix Committed
Status in Ubuntu Cloud Archive ocata series:
Fix Committed
Status in Ubuntu Cloud Archive pike series:
Fix Released
Status in Ubuntu Cloud Archive queens series:
Fix Released
Status in neutron:
Fix Released
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Xenial:
Fix Committed
Status in neutron source package in Zesty:
Fix Committed
Status in neutron source package in Artful:
Fix Released
Status in neutron source package in Bionic:
Fix Released
Bug description:
OS: Xenial, Ocata from Ubuntu Cloud Archive
We have three neutron-gateway hosts, with L3 HA enabled and a min of 2, max of 3. There are approx. 400 routers defined.
At some point (we weren't monitoring exactly) a number of the routers
changed from being one active, and 1+ others standby, to >1 active.
This included each of the 'active' namespaces having the same IP
addresses allocated, and therefore traffic problems reaching
instances.
Removing the routers from all but one agent, and re-adding, resolved
the issue. Restarting one l3 agent also appeared to resolve the
issue, but very slowly, to the point where we needed the system alive
again faster and reverted to removing/re-adding.
At the same time, a number of routers were listed without any agents
active at all. This situation appears to have been resolved by adding
routers to agents, after several minutes downtime.
I'm finding it very difficult to find relevant keepalived messages to
indicate what's going on, but what I do notice is that all the agents
have equal priority and are configured as 'backup'.
I am trying to figure out a way to get a reproducer of this, it might
be that we need to have a large number of routers configured on a
small number of gateways.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1731595/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list