[SRU][B/aws-5.3][F/aws][PATCH v2 0/1] xen-netfront: fix potential deadlock in xennet_remove()

Andrea Righi andrea.righi at canonical.com
Thu Jul 23 13:18:26 UTC 2020

BugLink: https://bugs.launchpad.net/bugs/1888510


During our AWS testing we were experiencing deadlocks on hibernate
across all Xen instance types. The trace was showing that the system was
stuck in xennet_remove():

[ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 358.115102] modprobe D 0 4892 4833 0x00004004
[ 358.115104] Call Trace:
[ 358.115112] __schedule+0x2a8/0x670
[ 358.115115] schedule+0x33/0xa0
[ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
[ 358.115121] ? wait_woken+0x80/0x80
[ 358.115124] xenbus_dev_remove+0x51/0xa0
[ 358.115126] device_release_driver_internal+0xe0/0x1b0
[ 358.115127] driver_detach+0x49/0x90
[ 358.115129] bus_remove_driver+0x59/0xd0
[ 358.115131] driver_unregister+0x2c/0x40
[ 358.115132] xenbus_unregister_driver+0x12/0x20
[ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
[ 358.115137] __x64_sys_delete_module+0x146/0x290
[ 358.115140] do_syscall_64+0x5a/0x130
[ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This prevented hibernation to complete.

The reason of this problem is a race condition in xennet_remove(): the
system is reading the current state of the bus, it's requesting to
change the state to "Closing", and it's waiting for the state to be
changed to "Closing". However, if the state becomes "Closed" between
reading the state and requesting the state change, we are stuck forever,
because the state will never change from "Closed" back to "Closing".

[Test case]

Create any Xen-based instance in AWS, hibernate/resume multiple times.
Some times the system gets stuck (hung task timeout) and hibernation


Prevent the deadlock by changing the wait condition to check also for
state == Closed.

This is also an upstream bug, I posted a patch to the LKML and I'm
waiting for a review / feedbacks:

This patch is not applied upstream, but both our tests and the tests
performed by Amazon show positive results after applying this fix (the
deadlock doesn't seem to happen anymore).

[Regression potential]

Minimal, this change affects only Xen and more exactly only the
xen-netfront driver.

Changes in v2:
  - add missing BugLink
  - target the right kernels (B/aws-5.3 and F/aws)

