[Bug 1744062] Re: L3 HA: multiple agents are active at the same time

Tue Jul 3 13:51:38 UTC 2018

It appears the following commits are required to fix this for
keepalived:

commit e90a633c34fbe6ebbb891aa98bf29ce579b8b45c
Author: Quentin Armitage <quentin at armitage.org.uk>
Date:   Fri Dec 15 21:14:24 2017 +0000

    Fix removing left-over addresses if keepalived aborts

    Issue #718 reported that if keepalived terminates abnormally when
    it has vrrp instances in master state, it doesn't remove the
    left-over VIPs and eVIPs when it restarts. This is despite
    commit f4c10426c saying that it resolved this problem.

    It turns out that commit f4c10426c did resolve the problem for VIPs
    or eVIPs, although it did resolve the issue for iptables and ipset
    configuration.

    This commit now really resolves the problem, and residual VIPs and
    eVIPs are removed at startup.

    Signed-off-by: Quentin Armitage <quentin at armitage.org.uk>

commit f4c10426ca0a7c3392422c22079f1b71e7d4ebe9
Author: Quentin Armitage <quentin at armitage.org.uk>
Date:   Sun Mar 6 09:53:27 2016 +0000

    Remove ip addresses left over from previous failure

    If keepalived terminates unexpectedly, for any instances for which
    it was master, it leaves ip addresses configured on the interfaces.
    When keepalived restarts, if it starts in backup mode, the addresses
    must be removed. In addition, any iptables/ipsets entries added for
    !accept_mode must also be removed, in order to avoid multiple entries
    being created in iptables.

    This commit removes any addresses and iptables/ipsets configuration
    for any interfaces that exist when iptables starts up. If keepalived
    shut down cleanly, that will only be for non-vmac interfaces, but if
    it terminated unexpectedly, it can also be for any left-over vmacs.

    Signed-off-by: Quentin Armitage <quentin at armitage.org.uk>

f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 is already included in:
* keepalived 1:1.3.9-1build1 (bionic/queens, cosmic/rocky)
* keepalived 1:1.3.2-1build1 (artful/pike)
* keepalived 1:1.3.2-1 (zesty/ocata) [1]

[1] zesty is EOL -
https://launchpad.net/ubuntu/+source/keepalived/1:1.3.2-1

f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 is not included in:
* keepalived 1:1.2.19-1ubuntu0.2 (xenial/mitaka)

The backport of f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 to xenial does
not look trivial. I'd prefer to backport keepalived 1:1.3.2-* to the
pike/ocata cloud archives.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1744062

Title:
  L3 HA: multiple agents are active at the same time

Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in Ubuntu Cloud Archive pike series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in neutron:
  New
Status in keepalived package in Ubuntu:
  Triaged
Status in neutron package in Ubuntu:
  Triaged
Status in keepalived source package in Xenial:
  Triaged
Status in neutron source package in Xenial:
  Triaged
Status in keepalived source package in Bionic:
  Triaged
Status in neutron source package in Bionic:
  Triaged

Bug description:
  This is the same issue reported in
  https://bugs.launchpad.net/neutron/+bug/1731595, however that is
  marked as 'Fix Released' and the issue is still occurring and I can't
  change back to 'New' so it seems best to just open a new bug.

  It seems as if this bug surfaces due to load issues. While the fix
  provided by Venkata (https://review.openstack.org/#/c/522641/) should
  help clean things up at the time of l3 agent restart, issues seem to
  come back later down the line in some circumstances. xavpaice
  mentioned he saw multiple routers active at the same time when they
  had 464 routers configured on 3 neutron gateway hosts using L3HA, and
  each router was scheduled to all 3 hosts. However, jhebden mentions
  that things seem stable at the 400 L3HA router mark, and it's worth
  noting this is the same deployment that xavpaice was referring to.

  It seems to me that something is being pushed to it's limit, and
  possibly once that limit is hit, master router advertisements aren't
  being received, causing a new master to be elected. If this is the
  case it would be great to get to the bottom of what resource is
  getting constrained.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744062/+subscriptions