[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Fri May 12 22:22:49 UTC 2023

Now, I still haven't looked at the segfaults, but I initially
couldn't reproduce them with just the dpdk pkg version change.

(And they didn't always happen / correlate with the traffic
interruption, so it does seem to be another problem.)

So, "just" library version incompatibility/interface breakage
on different ends such as QEMU/OVS/DPDK/RTE doesn't seem to be
the issue (which is likely reassured as there are no related
bugs/fixes to the DPDK packages in Bionic for a long time now;
so it seems to be a corner case).

Thus it _might_ be some weird state left in the VMs virtio-net
driver, that eventually got to talk back to DPDK vhost ports,
and provided it wrong pointers... or some unexpected state
as part of restarts of different components.  

This needs more assessment and information to determine next steps.

For starters, it'd be nice to know what's the kernel version
running in the VMs to check the virtio-net driver level and
features negotiated with the hypervisor's vhost side.

Marking dpdk as Incomplete.

** Changed in: openvswitch (Ubuntu)
       Status: Confirmed => Invalid

** Changed in: dpdk (Ubuntu)
       Status: Confirmed => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533

Title:
  Loss of network connectivity after upgrading dpdk packages from
  17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Status in dpdk package in Ubuntu:
  Incomplete
Status in openvswitch package in Ubuntu:
  Invalid

Bug description:
  We upgraded the following packages on a number of hosts running on bionic-queens:
  * dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
  * openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4

  It was just a plain `apt dist-upgrade` which upgraded a number of
  other packages - I can provide a full list of upgraded packages if
  needed.

  This resulted in a complete dataplane outage on a production cloud.

  Symptoms:

  1. Loss of network connectivity on virtual machines using
  dpdkvhostuser ports.

  VMs were unable to send any packets. Using `virsh console` we observed
  the following line printed a few times per second:

  net eth0: unexpected txq (0) queue failure: -5

  At the same time we also observed the following messages in OVS logs:

  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed

  rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
  of view) were not increasing.

  2. Segmentation faults in ovs/dpdk libraries.

  This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
  There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:

  [22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
  [22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]

  Or on another host:
  [22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
  [22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
  [22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
  [23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
  [23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
  [23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
  [  639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
  [ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
  [ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
  [ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]

  This was sort of "stabilized" by full restart of OVS and neutron
  agents and not touching any VMs but on one machine we still saw
  librte_vhost.so segfaults. But even without segfaults we still faced
  the issue with "net eth0: unexpected txq (0) queue failure: -5" and
  didn't have working connectivity.

  The issue was also easy to trigger by attempting a live migration of a
  VM that was using a vhu port although it was also crashing randomly on
  its own.

  Failed attempts to restore the dataplane included:
  1. Restart of ovs and neutron agents.
  2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
  3. Reboot of the hosts.
  4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.

  Solution:

  After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
  17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
  manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
  debs (63 packages in total). Full list of rolled back packages: [1]

  Please note that we also re-installed the latest available OVS (both
  openvswitch-switch and openvswitch-switch-dpdk) version before rolling
  back dpdk: 2.9.8-0ubuntu0.18.04.4.

  Actions taken after the downgrade:
  1. Stopped all VMs.
  2. Restarted OVS.
  3. Restarted neutron agents.
  4. Started all VMs.

  Rollback of 63 dpdk/librte_* packages and service restarts were the
  only actions that we needed to restore the connectivity on all
  machines. Error messages disappeared from VMs' console log (no more
  "net eth0: unexpected txq (0) queue failure: -5"). OVS started to
  report rx_* counters rising on vhu ports. Segmentation faults from ovs
  and pmd have stopped as well.

  [0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
  [1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions