[Bug 1749425] Re: Neutron integrated with OpenVSwitch drops packets and fails to plug/unplug interfaces from OVS on router interfaces at scale

Fri Feb 23 22:45:55 UTC 2018

Hi folks, while trying to reproduce this behaviour myself I think i've
stumbled upon some interesting behaviour. I setup a test as follows and
checked for errors at specific points. I have a 4 node setup (24
core/64G ram) with 3 gateways and 1 compute. Neutron is configured with
l3_ha enabled and max_agents_per_router set to 3;

 stage 1: created two projects with 1 router each (which gives two sets
of keepalived each with the same VR_ID (1)) and checked keepalived logs
- system load is minimal, no re-elections observed post-create.

 stage 2: scaled horizontally to 200 projects each with 1 router (giving
200 routers with VR_ID 1 each within their own network). system load is
minimal, no re-elections observed post-create, observed that all master
state routers are on the same host.

 stage 3: scaled one project vertically by creating 200 routers within
same project. As i started to get into the VR_70s i started to see some
of the extant routers get re-elected e.g. "VRRP_Instance(VR_76) Received
higher prio advert". If i run a tcpdump on one of my ha- interfaces
inside a qrouter- namespace I see a flood of "VRRPv2, Advertisement"
with each VR_ID being advertised every 2s from the current master (as
expected since that's the default interval in neutron). The consequence
of this is that neutron is frequently having to cathup with keepalived
(by running neutron-keepalived-state-change) which causes more traffic
and all without cause since there is no need for these failovers to be
occurring.

 Since the advert interval is configurable in neutron [1] I am going to
go ahead and try changing it to see of that can stop these re-elections
but that seems a little hacky as a fix so just wondering if there's
another way to mitigate these effects. I need to double check the vrrp
spec but iirc since these advertisements are sent out by the master, if
the master dies it would affect how long it takes for a re-election to
occur (spec says "(3 * Advertisement_Interval) + Skew_time") and during
that time VMs would be unreachable so maybe there's another way.

[1]
https://github.com/openstack/neutron/blob/b90ec94dc3f83f63bdb505ace1e4c272435c494b/neutron/conf/agent/l3/ha.py#L35

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/1749425

Title:
  Neutron integrated with OpenVSwitch drops packets and fails to
  plug/unplug interfaces from OVS on router interfaces at scale

Status in neutron:
  New
Status in openvswitch package in Ubuntu:
  New

Bug description:
  Description:    Ubuntu 16.04.3 LTS
  Release:        16.04
  Linux 4.4.0-96-generic on AMD64
  Neutron 2:10.0.4-0ubuntu2~cloud0 from Cloud Archive xenial-updates/ocata
  OpenVSwitch 2.6.1-0ubuntu5.2~cloud0 from Cloud Archive xenial-upates/ocata

  In an environment with three bare-metal Neutron deployments, hosting
  upward of 300 routers, with approximately the same number of
  instances, typically one router per instance, packet loss on instances
  accessed via floating IPs, including complete connectivity loss, is
  experienced. The problem is exacerbated by enabling L3HA, likely due
  to the increase in router namespaces to be scheduled and managed, and
  the additional scheduling work of bringing up keepalived and
  monitoring the keepalived VIP.

  Reducing the number of routers and rescheduling routers on new hosts,
  causing the routers to undergo a full recreation of namespace,
  iptables rules, and replugging of interfaces into OVS will correct
  packet loss or connectivity loss on impacted routers.

  On Neutron hosts in this environment, we have used systemtap to trace
  calls to kfree_skb which reveals the majority of dropped packets occur
  in the openvswitch module, notably on the br-int bridge. Inspecting
  the state of OVS shows many qtap interfaces which are no longer
  present on the Neutron host which are still plugged in to OVS.

  Diagnostic outputs in following comments.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1749425/+subscriptions