[Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Tue Mar 16 02:00:34 UTC 2021

Looking to get this approved so that we can verify it, as needing this
ideally released by the weekend of March 27th for some maintenance
activity. Is something holding back the approval?

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in Ubuntu Cloud Archive rocky series:
  Fix Committed
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  In Progress
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Groovy:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released

Bug description:
  (SRU template copied from comment 42)

  [Impact]

  - When there is a RabbitMQ or neutron-api outage, the neutron-
  openvswitch-agent undergoes a "resync" process and temporarily blocks
  all VM traffic. This always happens for a short time period (maybe <1
  second) but in some high scale environments this lasts for minutes. If
  RabbitMQ is down again during the re-sync, traffic will also be
  blocked until it can connect which may be for a long period. This also
  affects situations where neutron-openvswitch-agent is intentionally
  restarted while RabbitMQ is down. Bug #1869808 addresses this issue
  and Bug #1887148 is a fix for that fix to prevent network loops during
  DVR startup.

  - In the same situation, the neutron-l3-agent can delete the L3 router
  (Bug #1871850)

  [Test Case]

  (1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant
  network (VXLAN or FLAT will not reproduce the issue). With a standard
  deployment, simply enabling DHCP on the ext_net subnet will allow VMs
  to be booted directly on the ext_net provider network. "openstack
  subnet set --dhcp ext_net and then deploy the VM directly to ext_net"

  (2) Deploy a VM to the VLAN network

  (3) Start pinging the VM from an external network

  (4) Stop all RabbitMQ servers

  (5) Restart neutron-openvswitch-agent

  (6) Ping traffic should cease and not recover

  (7) Start all RabbitMQ servers

  (8) Ping traffic will recover after 30-60 seconds

  [Where problems could occur]

  These patches are all cherry-picked from the upstream stable branches,
  and have existed upstream including the stable/queens branch for many
  months and in Ubuntu all supported subsequent releases (Stein onwards)
  have also had these patches for many months with the exception of
  Queens.

  There is a chance that not installing these drop flows during startup
  could have traffic go somewhere that's not expected when the network
  is in a partially setup case, this was the case for DVR and in setups
  where more than 1 DVR external network port existed a network loop was
  possibly temporarily created. This was already addressed with the
  included patch for Bug #1869808. Checked and could not locate any
  other merged changes to this drop_port logic that also need to be
  backported.

  [Other Info]

  [original description]

  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root at mgt01:~# openstack server list
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | ID                                   | Name            | Status | Networks                                 | Image                        | Flavor    |
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1        | ACTIVE | vlan105=172.31.10.4                      | Cirros 0.4.0 64-bit          | m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2        | ACTIVE | vlan105=172.31.10.18                     | Cirros 0.4.0 64-bit          | m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ......
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """       duration=secs
                The  time,  in  seconds,  that  the entry has been in the table.
                secs includes as much precision as the switch provides, possibly
                to nanosecond resolution.
  """

  root at compute01:~# ovs-ofctl dump-flows br-floating
  ...
   cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
                              ^------ this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions