[SRU][E/F][PATCH 0/1] xen-netfront: fix potential deadlock in xennet_remove()

Stefan Bader stefan.bader at canonical.com
Thu Jul 23 07:15:17 UTC 2020


On 22.07.20 15:47, Andrea Righi wrote:
> [Impact]
> 
> During our AWS testing we were experiencing deadlocks on hibernate
> across all Xen instance types. The trace was showing that the system was
> stuck in xennet_remove():
> 
> [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> [ 358.115102] modprobe D 0 4892 4833 0x00004004
> [ 358.115104] Call Trace:
> [ 358.115112] __schedule+0x2a8/0x670
> [ 358.115115] schedule+0x33/0xa0
> [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> [ 358.115121] ? wait_woken+0x80/0x80
> [ 358.115124] xenbus_dev_remove+0x51/0xa0
> [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> [ 358.115127] driver_detach+0x49/0x90
> [ 358.115129] bus_remove_driver+0x59/0xd0
> [ 358.115131] driver_unregister+0x2c/0x40
> [ 358.115132] xenbus_unregister_driver+0x12/0x20
> [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> [ 358.115137] __x64_sys_delete_module+0x146/0x290
> [ 358.115140] do_syscall_64+0x5a/0x130
> [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This prevented hibernation to complete.
> 
> The reason of this problem is a race condition in xennet_remove(): the
> system is reading the current state of the bus, it's requesting to
> change the state to "Closing", and it's waiting for the state to be
> changed to "Closing". However, if the state becomes "Closed" between
> reading the state and requesting the state change, we are stuck forever,
> because the state will never change from "Closed" back to "Closing".
> 
> [Test case]
> 
> Create any Xen-based instance in AWS, hibernate/resume multiple times.
> Some times the system gets stuck (hung task timeout) and hibernation
> fails.
> 
> [Fix]
> 
> Prevent the deadlock by changing the wait condition to check also for
> state == Closed.
> 
> This is also an upstream bug, I posted a patch to the LKML and I'm
> waiting for a review / feedbacks:
> https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> 
> This patch is not applied upstream, but both our tests and the tests
> performed by Amazon show positive results after applying this fix (the
> deadlock doesn't seem to happen anymore).
> 
> [Regression potential]
> 
> Minimal, this change affects only Xen and more exactly only the
> xen-netfront driver.
> 
> 
Beside the BugLink which we can add later, there is a different target source in
the bug report vs. the submission. The bug report is against linux-aws, the
submission would be against the main kernel.

Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be
handled with extra care (this is probably less for Andrea than for whoever is
going to apply).

-Stefan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20200723/3a8ef60b/attachment.sig>


More information about the kernel-team mailing list