[Bug 1744062] Re: [SRU] L3 HA: multiple agents are active at the same time

Edward Hope-Morley edward.hope-morley at canonical.com
Wed Jul 25 10:59:39 UTC 2018


Ok I have now completed testing the bionic-proposed keepalived package
with Openstack Queens and am happy that it resolves the problem of
ensuring that keepalived will teardown routes, vips, evips etc when it
comes back up and transitions from master to backup. My test comprised
of deploying Queens with 3 gateways, creating 100 users/projects each
with 1 router, creating some instances with floating ips then forcibly
killing both the keepalived and neutron-keepalived-state-change
processes associated with a particular router for which i have an
instance with a fip. I then observed that the qrouter ns interfaces for
that router were definitely unconfigured and the vrrp transition
happened as expected. This is in contrast to e.g. keepalived
1:1.2.19-1ubuntu0.2 available with all Xenial releases of Openstack for
which I consistently see the qrouter interfaces remain configured on > 1
gateway.

For completeness (although not having any bearing on the keepalived fix)
I also still see the other issue remain for bionic whereby in neutron
the router is listed as being active on > 1 host e.g.

(truncating so that it will display properly)
+-//---------------------------+---------+----------------+-------+----------+
| //       id                  |   host  | admin_state_up | alive | ha_state |
+-//---------------------------+---------+----------------+-------+----------+
| //901-4edd-86fb-8dbfe7373255 | crustle |      True      |  :-)  |  active  |
| //961-4318-9743-775ebc9b0067 | chespin |      True      |  :-)  |  active  |
| //628-4c2e-8e91-c309e4477c75 |  orgen  |      True      |  :-)  | standby  |
+-//---------------------------+---------+----------------+-------+----------+

The reason for this is simple and the good news is that with the fixed
keepalived it is also benign. Neutron detects state changes by running
ip monitor on the qrouter interfaces and since my test involved killing
both neutron-keepalived-state-change (that runs ip monitor) and
keepalived, the vrrp transition appears to have happened before neutron
had ip monitor running again. Looking at the l3-agent logs is see:

2018-07-25 10:19:33.636 14018 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid 75d24bfb-9807-4216-af4a-3aac37cf2417
2018-07-25 10:19:33.638 14018 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-75d24bfb-9807-4216-af4a-3aac37cf2417', 'keepalived', '-P', '-f', '/var/lib/neutron/ha_confs/
2018-07-25 10:19:33.886 14018 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-75d24bfb-9807-4216-af4a-3aac37cf2417', 'neutron-keepalived-state-change', '--router_id=75d24

i.e. neutron starts keepalived BEFORE keepalived-state-change so if the
transition and teardown happens prior to the latter coming up and
launching ip monitor it never sees the changes and has nothing to report
to neutron.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1744062

Title:
  [SRU] L3 HA: multiple agents are active at the same time

Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in Ubuntu Cloud Archive pike series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Fix Committed
Status in neutron:
  New
Status in keepalived package in Ubuntu:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in keepalived source package in Xenial:
  Triaged
Status in neutron source package in Xenial:
  New
Status in keepalived source package in Bionic:
  Fix Committed
Status in neutron source package in Bionic:
  New

Bug description:
  [Impact]

  This is the same issue reported in
  https://bugs.launchpad.net/neutron/+bug/1731595, however that is
  marked as 'Fix Released' and the issue is still occurring and I can't
  change back to 'New' so it seems best to just open a new bug.

  It seems as if this bug surfaces due to load issues. While the fix
  provided by Venkata in https://bugs.launchpad.net/neutron/+bug/1731595
  (https://review.openstack.org/#/c/522641/) should help clean things up
  at the time of l3 agent restart, issues seem to come back later down
  the line in some circumstances. xavpaice mentioned he saw multiple
  routers active at the same time when they had 464 routers configured
  on 3 neutron gateway hosts using L3HA, and each router was scheduled
  to all 3 hosts. However, jhebden mentions that things seem stable at
  the 400 L3HA router mark, and it's worth noting this is the same
  deployment that xavpaice was referring to.

  keepalived has a patch upstream in 1.4.0 that provides a fix for
  removing left-over addresses if keepalived aborts. That patch will be
  cherry-picked to Ubuntu keepalived packages.

  [Test Case]
  The following SRU process will be followed:
  https://wiki.ubuntu.com/OpenStackUpdates

  In order to avoid regression of existing consumers, the OpenStack team
  will run their continuous integration test against the packages that
  are in -proposed. A successful run of all available tests will be
  required before the proposed packages can be let into -updates.

  The OpenStack team will be in charge of attaching the output summary
  of the executed tests. The OpenStack team members will not mark
  ‘verification-done’ until this has happened.

  [Regression Potential]
  The regression potential is lowered as the fix is cherry-picked without change from upstream. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug.

  [Discussion]

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744062/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list