[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Fri May 12 22:19:12 UTC 2023

I could reproduce the traffic interruption in a test environment 
with the commands executed by the packaging scripts:

$ sudo update-alternatives --remove ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
$ sudo update-alternatives --install /usr/sbin/ovs-vswitchd ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk 50
$ sudo invoke-rc.d openvswitch-switch restart

The VMs stopped pinging each other.
(There are no errors in the VMs console, but this might
differ due to virtio-net driver differences to their VMs).

The fix-up probably happened when Bootstack re-installed
the latest ovs dpdk package, or along the way, when the
update-alternatives was been fixed bacl to ovs-dpdk and
ovs restarted. 
[as per `Please note that we also re-installed the latest 
available OVS (both openvswitch-switch and openvswitch-switch-dpdk)
 version before rolling back dpdk: 2.9.8-0ubuntu0.18.04.4.]

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533

Title:
  Loss of network connectivity after upgrading dpdk packages from
  17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Status in dpdk package in Ubuntu:
  Incomplete
Status in openvswitch package in Ubuntu:
  Invalid

Bug description:
  We upgraded the following packages on a number of hosts running on bionic-queens:
  * dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
  * openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4

  It was just a plain `apt dist-upgrade` which upgraded a number of
  other packages - I can provide a full list of upgraded packages if
  needed.

  This resulted in a complete dataplane outage on a production cloud.

  Symptoms:

  1. Loss of network connectivity on virtual machines using
  dpdkvhostuser ports.

  VMs were unable to send any packets. Using `virsh console` we observed
  the following line printed a few times per second:

  net eth0: unexpected txq (0) queue failure: -5

  At the same time we also observed the following messages in OVS logs:

  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed

  rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
  of view) were not increasing.

  2. Segmentation faults in ovs/dpdk libraries.

  This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
  There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:

  [22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
  [22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]

  Or on another host:
  [22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
  [22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
  [22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
  [23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
  [23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
  [23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
  [  639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
  [ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
  [ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
  [ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]

  This was sort of "stabilized" by full restart of OVS and neutron
  agents and not touching any VMs but on one machine we still saw
  librte_vhost.so segfaults. But even without segfaults we still faced
  the issue with "net eth0: unexpected txq (0) queue failure: -5" and
  didn't have working connectivity.

  The issue was also easy to trigger by attempting a live migration of a
  VM that was using a vhu port although it was also crashing randomly on
  its own.

  Failed attempts to restore the dataplane included:
  1. Restart of ovs and neutron agents.
  2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
  3. Reboot of the hosts.
  4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.

  Solution:

  After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
  17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
  manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
  debs (63 packages in total). Full list of rolled back packages: [1]

  Please note that we also re-installed the latest available OVS (both
  openvswitch-switch and openvswitch-switch-dpdk) version before rolling
  back dpdk: 2.9.8-0ubuntu0.18.04.4.

  Actions taken after the downgrade:
  1. Stopped all VMs.
  2. Restarted OVS.
  3. Restarted neutron agents.
  4. Started all VMs.

  Rollback of 63 dpdk/librte_* packages and service restarts were the
  only actions that we needed to restore the connectivity on all
  machines. Error messages disappeared from VMs' console log (no more
  "net eth0: unexpected txq (0) queue failure: -5"). OVS started to
  report rx_* counters rising on vhu ports. Segmentation faults from ovs
  and pmd have stopped as well.

  [0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
  [1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions