[Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic
Trent Lloyd
1869808 at bugs.launchpad.net
Tue Mar 16 02:00:34 UTC 2021
Looking to get this approved so that we can verify it, as needing this
ideally released by the weekend of March 27th for some maintenance
activity. Is something holding back the approval?
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1869808
Title:
reboot neutron-ovs-agent introduces a short interrupt of vlan traffic
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive queens series:
Triaged
Status in Ubuntu Cloud Archive rocky series:
Fix Committed
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Released
Status in Ubuntu Cloud Archive victoria series:
Fix Released
Status in neutron:
Fix Released
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Bionic:
In Progress
Status in neutron source package in Focal:
Fix Released
Status in neutron source package in Groovy:
Fix Released
Status in neutron source package in Hirsute:
Fix Released
Bug description:
(SRU template copied from comment 42)
[Impact]
- When there is a RabbitMQ or neutron-api outage, the neutron-
openvswitch-agent undergoes a "resync" process and temporarily blocks
all VM traffic. This always happens for a short time period (maybe <1
second) but in some high scale environments this lasts for minutes. If
RabbitMQ is down again during the re-sync, traffic will also be
blocked until it can connect which may be for a long period. This also
affects situations where neutron-openvswitch-agent is intentionally
restarted while RabbitMQ is down. Bug #1869808 addresses this issue
and Bug #1887148 is a fix for that fix to prevent network loops during
DVR startup.
- In the same situation, the neutron-l3-agent can delete the L3 router
(Bug #1871850)
[Test Case]
(1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant
network (VXLAN or FLAT will not reproduce the issue). With a standard
deployment, simply enabling DHCP on the ext_net subnet will allow VMs
to be booted directly on the ext_net provider network. "openstack
subnet set --dhcp ext_net and then deploy the VM directly to ext_net"
(2) Deploy a VM to the VLAN network
(3) Start pinging the VM from an external network
(4) Stop all RabbitMQ servers
(5) Restart neutron-openvswitch-agent
(6) Ping traffic should cease and not recover
(7) Start all RabbitMQ servers
(8) Ping traffic will recover after 30-60 seconds
[Where problems could occur]
These patches are all cherry-picked from the upstream stable branches,
and have existed upstream including the stable/queens branch for many
months and in Ubuntu all supported subsequent releases (Stein onwards)
have also had these patches for many months with the exception of
Queens.
There is a chance that not installing these drop flows during startup
could have traffic go somewhere that's not expected when the network
is in a partially setup case, this was the case for DVR and in setups
where more than 1 DVR external network port existed a network loop was
possibly temporarily created. This was already addressed with the
included patch for Bug #1869808. Checked and could not locate any
other merged changes to this drop_port logic that also need to be
backported.
[Other Info]
[original description]
We are using Openstack Neutron 13.0.6 and it is deployed using
OpenStack-helm.
I test ping servers in the same vlan while rebooting neutron-ovs-
agent. The result shows
root at mgt01:~# openstack server list
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny |
$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
......
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms
As one can see, packet seq 62 is lost, I believe, during rebooting ovs
agent.
Right now, I am suspecting
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
this code is refreshing flow table rules even though it is not
necessary.
Because when I dump flows on phys bridge, I can see duration is
rewinding to 0 which suggests flow has been deleted and created again
""" duration=secs
The time, in seconds, that the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
"""
root at compute01:~# ovs-ofctl dump-flows br-floating
...
cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
^------ this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...
IMO, rebooting ovs-agent should not affecting data plane.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions
More information about the Ubuntu-sponsors
mailing list