[Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic
Edward Hope-Morley
1869808 at bugs.launchpad.net
Wed Mar 24 17:31:58 UTC 2021
Verified Bionic queens using [Test Plan] with output as follows:
# apt-cache policy neutron-common
neutron-common:
Installed: 2:12.1.1-0ubuntu4
Candidate: 2:12.1.1-0ubuntu4
Version table:
*** 2:12.1.1-0ubuntu4 500
500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-proposed/main amd64 Packages
100 /var/lib/dpkg/status
2:12.1.1-0ubuntu3 500
500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
2:12.0.1-0ubuntu1 500
500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
I ran a ping for the duration of restarting both rabbit and neutron-openvswitch-agent and did not see any interruption.
** Description changed:
(SRU template copied from comment 42)
[Impact]
- When there is a RabbitMQ or neutron-api outage, the neutron-
openvswitch-agent undergoes a "resync" process and temporarily blocks
all VM traffic. This always happens for a short time period (maybe <1
second) but in some high scale environments this lasts for minutes. If
RabbitMQ is down again during the re-sync, traffic will also be blocked
until it can connect which may be for a long period. This also affects
situations where neutron-openvswitch-agent is intentionally restarted
while RabbitMQ is down. Bug #1869808 addresses this issue and Bug
#1887148 is a fix for that fix to prevent network loops during DVR
startup.
- In the same situation, the neutron-l3-agent can delete the L3 router
(Bug #1871850), or may need to refresh the tunnel (Bug #1853613), or may
need to update flows or reconfigure bridges (Bug #1864822)
[Test Case]
(1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant network
(VXLAN or FLAT will not reproduce the issue). With a standard
deployment, simply enabling DHCP on the ext_net subnet will allow VMs to
be booted directly on the ext_net provider network. "openstack subnet
set --dhcp ext_net and then deploy the VM directly to ext_net"
(2) Deploy a VM to the VLAN network
(3) Start pinging the VM from an external network
(4) Stop all RabbitMQ servers
(5) Restart neutron-openvswitch-agent
- (6) Ping traffic should cease and not recover
+ (6) Ping traffic should NOT see interruption
(7) Start all RabbitMQ servers
- (8) Ping traffic will recover after 30-60 seconds
+ (8) Ping traffic should still be fine
[Where problems could occur]
These patches are all cherry-picked from the upstream stable branches,
and have existed upstream including the stable/queens branch for many
months and in Ubuntu all supported subsequent releases (Stein onwards)
have also had these patches for many months with the exception of
Queens.
There is a chance that not installing these drop flows during startup
could have traffic go somewhere that's not expected when the network is
in a partially setup case, this was the case for DVR and in setups where
more than 1 DVR external network port existed a network loop was
possibly temporarily created. This was already addressed with the
included patch for Bug #1869808. Checked and could not locate any other
merged changes to this drop_port logic that also need to be backported.
[Other Info]
[original description]
We are using Openstack Neutron 13.0.6 and it is deployed using
OpenStack-helm.
I test ping servers in the same vlan while rebooting neutron-ovs-agent.
The result shows
root at mgt01:~# openstack server list
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny |
$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
......
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms
As one can see, packet seq 62 is lost, I believe, during rebooting ovs
agent.
Right now, I am suspecting
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
this code is refreshing flow table rules even though it is not
necessary.
Because when I dump flows on phys bridge, I can see duration is
rewinding to 0 which suggests flow has been deleted and created again
""" duration=secs
The time, in seconds, that the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
"""
root at compute01:~# ovs-ofctl dump-flows br-floating
...
cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
^------ this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...
IMO, rebooting ovs-agent should not affecting data plane.
** Description changed:
(SRU template copied from comment 42)
[Impact]
- When there is a RabbitMQ or neutron-api outage, the neutron-
openvswitch-agent undergoes a "resync" process and temporarily blocks
all VM traffic. This always happens for a short time period (maybe <1
second) but in some high scale environments this lasts for minutes. If
RabbitMQ is down again during the re-sync, traffic will also be blocked
until it can connect which may be for a long period. This also affects
situations where neutron-openvswitch-agent is intentionally restarted
while RabbitMQ is down. Bug #1869808 addresses this issue and Bug
#1887148 is a fix for that fix to prevent network loops during DVR
startup.
- In the same situation, the neutron-l3-agent can delete the L3 router
(Bug #1871850), or may need to refresh the tunnel (Bug #1853613), or may
need to update flows or reconfigure bridges (Bug #1864822)
- [Test Case]
+ [Test Plan]
(1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant network
(VXLAN or FLAT will not reproduce the issue). With a standard
deployment, simply enabling DHCP on the ext_net subnet will allow VMs to
be booted directly on the ext_net provider network. "openstack subnet
set --dhcp ext_net and then deploy the VM directly to ext_net"
(2) Deploy a VM to the VLAN network
(3) Start pinging the VM from an external network
(4) Stop all RabbitMQ servers
(5) Restart neutron-openvswitch-agent
(6) Ping traffic should NOT see interruption
(7) Start all RabbitMQ servers
(8) Ping traffic should still be fine
[Where problems could occur]
These patches are all cherry-picked from the upstream stable branches,
and have existed upstream including the stable/queens branch for many
months and in Ubuntu all supported subsequent releases (Stein onwards)
have also had these patches for many months with the exception of
Queens.
There is a chance that not installing these drop flows during startup
could have traffic go somewhere that's not expected when the network is
in a partially setup case, this was the case for DVR and in setups where
more than 1 DVR external network port existed a network loop was
possibly temporarily created. This was already addressed with the
included patch for Bug #1869808. Checked and could not locate any other
merged changes to this drop_port logic that also need to be backported.
[Other Info]
[original description]
We are using Openstack Neutron 13.0.6 and it is deployed using
OpenStack-helm.
I test ping servers in the same vlan while rebooting neutron-ovs-agent.
The result shows
root at mgt01:~# openstack server list
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny |
$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
......
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms
As one can see, packet seq 62 is lost, I believe, during rebooting ovs
agent.
Right now, I am suspecting
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
this code is refreshing flow table rules even though it is not
necessary.
Because when I dump flows on phys bridge, I can see duration is
rewinding to 0 which suggests flow has been deleted and created again
""" duration=secs
The time, in seconds, that the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
"""
root at compute01:~# ovs-ofctl dump-flows br-floating
...
cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
^------ this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...
IMO, rebooting ovs-agent should not affecting data plane.
** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1869808
Title:
reboot neutron-ovs-agent introduces a short interrupt of vlan traffic
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive queens series:
Fix Committed
Status in Ubuntu Cloud Archive rocky series:
Fix Committed
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Released
Status in Ubuntu Cloud Archive victoria series:
Fix Released
Status in neutron:
Fix Released
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Bionic:
Fix Committed
Status in neutron source package in Focal:
Fix Released
Status in neutron source package in Groovy:
Fix Released
Status in neutron source package in Hirsute:
Fix Released
Bug description:
(SRU template copied from comment 42)
[Impact]
- When there is a RabbitMQ or neutron-api outage, the neutron-
openvswitch-agent undergoes a "resync" process and temporarily blocks
all VM traffic. This always happens for a short time period (maybe <1
second) but in some high scale environments this lasts for minutes. If
RabbitMQ is down again during the re-sync, traffic will also be
blocked until it can connect which may be for a long period. This also
affects situations where neutron-openvswitch-agent is intentionally
restarted while RabbitMQ is down. Bug #1869808 addresses this issue
and Bug #1887148 is a fix for that fix to prevent network loops during
DVR startup.
- In the same situation, the neutron-l3-agent can delete the L3 router
(Bug #1871850), or may need to refresh the tunnel (Bug #1853613), or
may need to update flows or reconfigure bridges (Bug #1864822)
[Test Plan]
(1) Deploy Openstack Bionic-Queens with DVR and a *VLAN* tenant
network (VXLAN or FLAT will not reproduce the issue). With a standard
deployment, simply enabling DHCP on the ext_net subnet will allow VMs
to be booted directly on the ext_net provider network. "openstack
subnet set --dhcp ext_net and then deploy the VM directly to ext_net"
(2) Deploy a VM to the VLAN network
(3) Start pinging the VM from an external network
(4) Stop all RabbitMQ servers
(5) Restart neutron-openvswitch-agent
(6) Ping traffic should NOT see interruption
(7) Start all RabbitMQ servers
(8) Ping traffic should still be fine
[Where problems could occur]
These patches are all cherry-picked from the upstream stable branches,
and have existed upstream including the stable/queens branch for many
months and in Ubuntu all supported subsequent releases (Stein onwards)
have also had these patches for many months with the exception of
Queens.
There is a chance that not installing these drop flows during startup
could have traffic go somewhere that's not expected when the network
is in a partially setup case, this was the case for DVR and in setups
where more than 1 DVR external network port existed a network loop was
possibly temporarily created. This was already addressed with the
included patch for Bug #1869808. Checked and could not locate any
other merged changes to this drop_port logic that also need to be
backported.
[Other Info]
[original description]
We are using Openstack Neutron 13.0.6 and it is deployed using
OpenStack-helm.
I test ping servers in the same vlan while rebooting neutron-ovs-
agent. The result shows
root at mgt01:~# openstack server list
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny |
$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
......
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms
As one can see, packet seq 62 is lost, I believe, during rebooting ovs
agent.
Right now, I am suspecting
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
this code is refreshing flow table rules even though it is not
necessary.
Because when I dump flows on phys bridge, I can see duration is
rewinding to 0 which suggests flow has been deleted and created again
""" duration=secs
The time, in seconds, that the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
"""
root at compute01:~# ovs-ofctl dump-flows br-floating
...
cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
^------ this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...
IMO, rebooting ovs-agent should not affecting data plane.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions
More information about the Ubuntu-sponsors
mailing list