[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic
Mauricio Faria de Oliveira
2015533 at bugs.launchpad.net
Fri May 12 22:18:43 UTC 2023
The traffic interruption in dpdkvhostuserclient ports is probably
due to an issue in the package upgrade of `openvswitch-switch-dpdk`,
associated with the later OVS restart by `openvswitch-switch`.
1) When `openvswitch-switch-dpdk` is upgraded, the old version
is affected by bug 1836713 [1], and resets OVS back to non-DPDK:
@ sosreport-
brtlvmrs0613co-00358062-2023-04-08-ltxuscc:var/log/apt/term.log
Preparing to unpack .../190-openvswitch-switch-dpdk_2.9.8-0ubuntu0.18.04.4_amd64.deb ...
update-alternatives: removing manually selected alternative - switching ovs-vswitchd to auto mode
update-alternatives: using /usr/lib/openvswitch-switch/ovs-vswitchd to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in auto mode
Unpacking openvswitch-switch-dpdk (2.9.8-0ubuntu0.18.04.4) over (2.9.5-0ubuntu0.18.04.1) ...
2) When `openvswitch-switch` is upgraded, it restarts OVS:
@ openvswitch-switch.postinst:
# summary of how this script can be called:
# * <postinst> `configure' <most-recently-configured-version>
...
if [ "$1" = "configure" ] || ... ; then
...
if [ -n "$2" ]; then
_dh_action=restart
...
invoke-rc.d openvswitch-switch $_dh_action || exit 1
...
fi
Apr 06 07:18:14 brtlvmrs0613co ovs-ctl[49562]: * Exiting ovs-vswitchd (3657)
...
Apr 06 07:18:15 brtlvmrs0613co ovs-vswitchd[49757]: ovs|00007|dpdk|ERR|DPDK not supported in this copy of Open ...
...
Apr 06 07:18:16 brtlvmrs0613co ovs-ctl[49717]: * Starting ovs-vswitchd
3) When OVS (non-DPDK) restarts, the dpdkvhostuserclient ports
cannot be added back:
2023-04-06T07:18:15.683Z|00029|netdev|WARN|could not create netdev vhu698a70de-9a of unknown type dpdkvhostuserclient
...
2023-04-06T07:18:15.683Z|00031|netdev|WARN|could not create netdev vhu9471d4d7-5b of unknown type dpdkvhostuserclient
...
2023-04-06T07:18:15.683Z|00033|netdev|WARN|could not create netdev vhu6caa02dd-b2 of unknown type dpdkvhostuserclient
4) Now the VMs have vhost-user ports in non-functional state,
waiting for OVS DPDK (which is not running) to start ports w/
vhost-user client to connect to the vhost-user server in QEMU.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533
Title:
Loss of network connectivity after upgrading dpdk packages from
17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic
Status in dpdk package in Ubuntu:
Incomplete
Status in openvswitch package in Ubuntu:
Invalid
Bug description:
We upgraded the following packages on a number of hosts running on bionic-queens:
* dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
* openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4
It was just a plain `apt dist-upgrade` which upgraded a number of
other packages - I can provide a full list of upgraded packages if
needed.
This resulted in a complete dataplane outage on a production cloud.
Symptoms:
1. Loss of network connectivity on virtual machines using
dpdkvhostuser ports.
VMs were unable to send any packets. Using `virsh console` we observed
the following line printed a few times per second:
net eth0: unexpected txq (0) queue failure: -5
At the same time we also observed the following messages in OVS logs:
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed
rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
of view) were not increasing.
2. Segmentation faults in ovs/dpdk libraries.
This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:
[22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
[22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]
Or on another host:
[22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
[22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
[22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
[23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
[23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
[23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
[ 639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
[ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
[ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
[ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]
This was sort of "stabilized" by full restart of OVS and neutron
agents and not touching any VMs but on one machine we still saw
librte_vhost.so segfaults. But even without segfaults we still faced
the issue with "net eth0: unexpected txq (0) queue failure: -5" and
didn't have working connectivity.
The issue was also easy to trigger by attempting a live migration of a
VM that was using a vhu port although it was also crashing randomly on
its own.
Failed attempts to restore the dataplane included:
1. Restart of ovs and neutron agents.
2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
3. Reboot of the hosts.
4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.
Solution:
After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
debs (63 packages in total). Full list of rolled back packages: [1]
Please note that we also re-installed the latest available OVS (both
openvswitch-switch and openvswitch-switch-dpdk) version before rolling
back dpdk: 2.9.8-0ubuntu0.18.04.4.
Actions taken after the downgrade:
1. Stopped all VMs.
2. Restarted OVS.
3. Restarted neutron agents.
4. Started all VMs.
Rollback of 63 dpdk/librte_* packages and service restarts were the
only actions that we needed to restore the connectivity on all
machines. Error messages disappeared from VMs' console log (no more
"net eth0: unexpected txq (0) queue failure: -5"). OVS started to
report rx_* counters rising on vhu ports. Segmentation faults from ovs
and pmd have stopped as well.
[0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
[1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list