[Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Thu Feb 18 17:01:22 UTC 2021

** Description changed:

+ (SRU template copied from comment 42)
+ 
+ [Impact]
+ 
+ - When there is a RabbitMQ or neutron-api outage, the neutron-
+ openvswitch-agent undergoes a "resync" process and temporarily blocks
+ all VM traffic. This always happens for a short time period (maybe <1
+ second) but in some high scale environments this lasts for minutes. If
+ RabbitMQ is down again during the re-sync, traffic will also be blocked
+ until it can connect which may be for a long period. This also affects
+ situations where neutron-openvswitch-agent is intentionally restarted
+ while RabbitMQ is down. Bug #1869808 addresses this issue and Bug
+ #1887148 is a fix for that fix to prevent network loops during DVR
+ startup.
+ 
+ - In the same situation, the neutron-l3-agent can delete the L3 router
+ (Bug #1871850)
+ 
+ [Test Case]
+ 
+ (1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant network
+ (VXLAN or FLAT will not reproduce the issue). With a standard
+ deployment, simply enabling DHCP on the ext_net subnet will allow VMs to
+ be booted directly on the ext_net provider network. "openstack subnet
+ set --dhcp ext_net and then deploy the VM directly to ext_net"
+ 
+ (2) Deploy a VM to the VLAN network
+ 
+ (3) Start pinging the VM from an external network
+ 
+ (4) Stop all RabbitMQ servers
+ 
+ (5) Restart neutron-openvswitch-agent
+ 
+ (6) Ping traffic should cease and not recover
+ 
+ (7) Start all RabbitMQ servers
+ 
+ (8) Ping traffic will recover after 30-60 seconds
+ 
+ [Where problems could occur]
+ 
+ These patches are all cherry-picked from the upstream stable branches,
+ and have existed upstream including the stable/queens branch for many
+ months and in Ubuntu all supported subsequent releases (Stein onwards)
+ have also had these patches for many months with the exception of
+ Queens.
+ 
+ There is a chance that not installing these drop flows during startup
+ could have traffic go somewhere that's not expected when the network is
+ in a partially setup case, this was the case for DVR and in setups where
+ more than 1 DVR external network port existed a network loop was
+ possibly temporarily created. This was already addressed with the
+ included patch for Bug #1869808. Checked and could not locate any other
+ merged changes to this drop_port logic that also need to be backported.
+ 
+ [Other Info]
+ 
+ 
+ [original description]
+ 
  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-agent.
  The result shows

  root at mgt01:~# openstack server list
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | ID                                   | Name            | Status | Networks                                 | Image                        | Flavor    |
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1        | ACTIVE | vlan105=172.31.10.4                      | Cirros 0.4.0 64-bit          | m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2        | ACTIVE | vlan105=172.31.10.18                     | Cirros 0.4.0 64-bit          | m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ......
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """       duration=secs
-               The  time,  in  seconds,  that  the entry has been in the table.
-               secs includes as much precision as the switch provides, possibly
-               to nanosecond resolution.
+               The  time,  in  seconds,  that  the entry has been in the table.
+               secs includes as much precision as the switch provides, possibly
+               to nanosecond resolution.
  """

  root at compute01:~# ovs-ofctl dump-flows br-floating
  ...
-  cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409, 
-                             ^------ this value resets
+  cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
+                             ^------ this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Groovy:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released

Bug description:
  (SRU template copied from comment 42)

  [Impact]

  - When there is a RabbitMQ or neutron-api outage, the neutron-
  openvswitch-agent undergoes a "resync" process and temporarily blocks
  all VM traffic. This always happens for a short time period (maybe <1
  second) but in some high scale environments this lasts for minutes. If
  RabbitMQ is down again during the re-sync, traffic will also be
  blocked until it can connect which may be for a long period. This also
  affects situations where neutron-openvswitch-agent is intentionally
  restarted while RabbitMQ is down. Bug #1869808 addresses this issue
  and Bug #1887148 is a fix for that fix to prevent network loops during
  DVR startup.

  - In the same situation, the neutron-l3-agent can delete the L3 router
  (Bug #1871850)

  [Test Case]

  (1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant
  network (VXLAN or FLAT will not reproduce the issue). With a standard
  deployment, simply enabling DHCP on the ext_net subnet will allow VMs
  to be booted directly on the ext_net provider network. "openstack
  subnet set --dhcp ext_net and then deploy the VM directly to ext_net"

  (2) Deploy a VM to the VLAN network

  (3) Start pinging the VM from an external network

  (4) Stop all RabbitMQ servers

  (5) Restart neutron-openvswitch-agent

  (6) Ping traffic should cease and not recover

  (7) Start all RabbitMQ servers

  (8) Ping traffic will recover after 30-60 seconds

  [Where problems could occur]

  These patches are all cherry-picked from the upstream stable branches,
  and have existed upstream including the stable/queens branch for many
  months and in Ubuntu all supported subsequent releases (Stein onwards)
  have also had these patches for many months with the exception of
  Queens.

  There is a chance that not installing these drop flows during startup
  could have traffic go somewhere that's not expected when the network
  is in a partially setup case, this was the case for DVR and in setups
  where more than 1 DVR external network port existed a network loop was
  possibly temporarily created. This was already addressed with the
  included patch for Bug #1869808. Checked and could not locate any
  other merged changes to this drop_port logic that also need to be
  backported.

  [Other Info]

  [original description]

  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root at mgt01:~# openstack server list
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | ID                                   | Name            | Status | Networks                                 | Image                        | Flavor    |
  +--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1        | ACTIVE | vlan105=172.31.10.4                      | Cirros 0.4.0 64-bit          | m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2        | ACTIVE | vlan105=172.31.10.18                     | Cirros 0.4.0 64-bit          | m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ......
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """       duration=secs
                The  time,  in  seconds,  that  the entry has been in the table.
                secs includes as much precision as the switch provides, possibly
                to nanosecond resolution.
  """

  root at compute01:~# ovs-ofctl dump-flows br-floating
  ...
   cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
                              ^------ this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions