[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Fri May 12 22:18:43 UTC 2023

The traffic interruption in dpdkvhostuserclient ports is probably
due to an issue in the package upgrade of `openvswitch-switch-dpdk`,
associated with the later OVS restart by `openvswitch-switch`.

1) When `openvswitch-switch-dpdk` is upgraded, the old version
is affected by bug 1836713 [1], and resets OVS back to non-DPDK:

@ sosreport-
brtlvmrs0613co-00358062-2023-04-08-ltxuscc:var/log/apt/term.log

	Preparing to unpack .../190-openvswitch-switch-dpdk_2.9.8-0ubuntu0.18.04.4_amd64.deb ...
	update-alternatives: removing manually selected alternative - switching ovs-vswitchd to auto mode
	update-alternatives: using /usr/lib/openvswitch-switch/ovs-vswitchd to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in auto mode
	Unpacking openvswitch-switch-dpdk (2.9.8-0ubuntu0.18.04.4) over (2.9.5-0ubuntu0.18.04.1) ...

2) When `openvswitch-switch` is upgraded, it restarts OVS:

@ openvswitch-switch.postinst:

	# summary of how this script can be called:
	#        * <postinst> `configure' <most-recently-configured-version>
	...
	if [ "$1" = "configure" ] || ... ; then
		        ...
		        if [ -n "$2" ]; then
		                _dh_action=restart
			...
		        invoke-rc.d openvswitch-switch $_dh_action || exit 1
			...
	fi

	Apr 06 07:18:14 brtlvmrs0613co ovs-ctl[49562]:  * Exiting ovs-vswitchd (3657)
	...
	Apr 06 07:18:15 brtlvmrs0613co ovs-vswitchd[49757]: ovs|00007|dpdk|ERR|DPDK not supported in this copy of Open ...
	...
	Apr 06 07:18:16 brtlvmrs0613co ovs-ctl[49717]:  * Starting ovs-vswitchd

3) When OVS (non-DPDK) restarts, the dpdkvhostuserclient ports
cannot be added back:

	2023-04-06T07:18:15.683Z|00029|netdev|WARN|could not create netdev vhu698a70de-9a of unknown type dpdkvhostuserclient
	...
	2023-04-06T07:18:15.683Z|00031|netdev|WARN|could not create netdev vhu9471d4d7-5b of unknown type dpdkvhostuserclient
	...
	2023-04-06T07:18:15.683Z|00033|netdev|WARN|could not create netdev vhu6caa02dd-b2 of unknown type dpdkvhostuserclient

4) Now the VMs have vhost-user ports in non-functional state,
waiting for OVS DPDK (which is not running) to start ports w/
vhost-user client to connect to the vhost-user server in QEMU.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533

Title:
  Loss of network connectivity after upgrading dpdk packages from
  17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Status in dpdk package in Ubuntu:
  Incomplete
Status in openvswitch package in Ubuntu:
  Invalid

Bug description:
  We upgraded the following packages on a number of hosts running on bionic-queens:
  * dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
  * openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4

  It was just a plain `apt dist-upgrade` which upgraded a number of
  other packages - I can provide a full list of upgraded packages if
  needed.

  This resulted in a complete dataplane outage on a production cloud.

  Symptoms:

  1. Loss of network connectivity on virtual machines using
  dpdkvhostuser ports.

  VMs were unable to send any packets. Using `virsh console` we observed
  the following line printed a few times per second:

  net eth0: unexpected txq (0) queue failure: -5

  At the same time we also observed the following messages in OVS logs:

  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed

  rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
  of view) were not increasing.

  2. Segmentation faults in ovs/dpdk libraries.

  This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
  There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:

  [22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
  [22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]

  Or on another host:
  [22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
  [22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
  [22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
  [23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
  [23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
  [23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
  [  639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
  [ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
  [ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
  [ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]

  This was sort of "stabilized" by full restart of OVS and neutron
  agents and not touching any VMs but on one machine we still saw
  librte_vhost.so segfaults. But even without segfaults we still faced
  the issue with "net eth0: unexpected txq (0) queue failure: -5" and
  didn't have working connectivity.

  The issue was also easy to trigger by attempting a live migration of a
  VM that was using a vhu port although it was also crashing randomly on
  its own.

  Failed attempts to restore the dataplane included:
  1. Restart of ovs and neutron agents.
  2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
  3. Reboot of the hosts.
  4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.

  Solution:

  After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
  17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
  manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
  debs (63 packages in total). Full list of rolled back packages: [1]

  Please note that we also re-installed the latest available OVS (both
  openvswitch-switch and openvswitch-switch-dpdk) version before rolling
  back dpdk: 2.9.8-0ubuntu0.18.04.4.

  Actions taken after the downgrade:
  1. Stopped all VMs.
  2. Restarted OVS.
  3. Restarted neutron agents.
  4. Started all VMs.

  Rollback of 63 dpdk/librte_* packages and service restarts were the
  only actions that we needed to restore the connectivity on all
  machines. Error messages disappeared from VMs' console log (no more
  "net eth0: unexpected txq (0) queue failure: -5"). OVS started to
  report rx_* counters rising on vhu ports. Segmentation faults from ovs
  and pmd have stopped as well.

  [0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
  [1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions