[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic
Mauricio Faria de Oliveira
2015533 at bugs.launchpad.net
Fri May 12 22:20:46 UTC 2023
For OVS package upgrades from versions earlier than
`2.9.8-0ubuntu0.18.04.1` (this issue/upgrade is from
`2.9.5-0ubuntu0.18.04.1` (note .5 vs .8),
please manually remove update-alternatives in the prerm script,
as documented in bug 1836713.
[this has broke other DPDK clouds back then, see comment #3],
per comment #4/description:
$ sudo sed -i "/update-alternatives/d" /var/lib/dpkg/info/openvswitch-
switch-dpdk.prerm
and then upgrade openvswitch-switch-dpdk (or upgrade or dist-upgrade).
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533
Title:
Loss of network connectivity after upgrading dpdk packages from
17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic
Status in dpdk package in Ubuntu:
Incomplete
Status in openvswitch package in Ubuntu:
Invalid
Bug description:
We upgraded the following packages on a number of hosts running on bionic-queens:
* dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
* openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4
It was just a plain `apt dist-upgrade` which upgraded a number of
other packages - I can provide a full list of upgraded packages if
needed.
This resulted in a complete dataplane outage on a production cloud.
Symptoms:
1. Loss of network connectivity on virtual machines using
dpdkvhostuser ports.
VMs were unable to send any packets. Using `virsh console` we observed
the following line printed a few times per second:
net eth0: unexpected txq (0) queue failure: -5
At the same time we also observed the following messages in OVS logs:
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed
rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
of view) were not increasing.
2. Segmentation faults in ovs/dpdk libraries.
This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:
[22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
[22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]
Or on another host:
[22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
[22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
[22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
[23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
[23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
[23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
[ 639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
[ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
[ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
[ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]
This was sort of "stabilized" by full restart of OVS and neutron
agents and not touching any VMs but on one machine we still saw
librte_vhost.so segfaults. But even without segfaults we still faced
the issue with "net eth0: unexpected txq (0) queue failure: -5" and
didn't have working connectivity.
The issue was also easy to trigger by attempting a live migration of a
VM that was using a vhu port although it was also crashing randomly on
its own.
Failed attempts to restore the dataplane included:
1. Restart of ovs and neutron agents.
2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
3. Reboot of the hosts.
4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.
Solution:
After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
debs (63 packages in total). Full list of rolled back packages: [1]
Please note that we also re-installed the latest available OVS (both
openvswitch-switch and openvswitch-switch-dpdk) version before rolling
back dpdk: 2.9.8-0ubuntu0.18.04.4.
Actions taken after the downgrade:
1. Stopped all VMs.
2. Restarted OVS.
3. Restarted neutron agents.
4. Started all VMs.
Rollback of 63 dpdk/librte_* packages and service restarts were the
only actions that we needed to restore the connectivity on all
machines. Error messages disappeared from VMs' console log (no more
"net eth0: unexpected txq (0) queue failure: -5"). OVS started to
report rx_* counters rising on vhu ports. Segmentation faults from ovs
and pmd have stopped as well.
[0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
[1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list