ACK: [SRU][B/aws-5.3][F/aws][PATCH v2 0/1] xen-netfront: fix potential deadlock in xennet_remove()

Colin Ian King colin.king at canonical.com
Thu Jul 23 13:28:51 UTC 2020


On 23/07/2020 14:18, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1888510
> 
> [Impact]
> 
> During our AWS testing we were experiencing deadlocks on hibernate
> across all Xen instance types. The trace was showing that the system was
> stuck in xennet_remove():
> 
> [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> [ 358.115102] modprobe D 0 4892 4833 0x00004004
> [ 358.115104] Call Trace:
> [ 358.115112] __schedule+0x2a8/0x670
> [ 358.115115] schedule+0x33/0xa0
> [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> [ 358.115121] ? wait_woken+0x80/0x80
> [ 358.115124] xenbus_dev_remove+0x51/0xa0
> [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> [ 358.115127] driver_detach+0x49/0x90
> [ 358.115129] bus_remove_driver+0x59/0xd0
> [ 358.115131] driver_unregister+0x2c/0x40
> [ 358.115132] xenbus_unregister_driver+0x12/0x20
> [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> [ 358.115137] __x64_sys_delete_module+0x146/0x290
> [ 358.115140] do_syscall_64+0x5a/0x130
> [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This prevented hibernation to complete.
> 
> The reason of this problem is a race condition in xennet_remove(): the
> system is reading the current state of the bus, it's requesting to
> change the state to "Closing", and it's waiting for the state to be
> changed to "Closing". However, if the state becomes "Closed" between
> reading the state and requesting the state change, we are stuck forever,
> because the state will never change from "Closed" back to "Closing".
> 
> [Test case]
> 
> Create any Xen-based instance in AWS, hibernate/resume multiple times.
> Some times the system gets stuck (hung task timeout) and hibernation
> fails.
> 
> [Fix]
> 
> Prevent the deadlock by changing the wait condition to check also for
> state == Closed.
> 
> This is also an upstream bug, I posted a patch to the LKML and I'm
> waiting for a review / feedbacks:
> https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> 
> This patch is not applied upstream, but both our tests and the tests
> performed by Amazon show positive results after applying this fix (the
> deadlock doesn't seem to happen anymore).
> 
> [Regression potential]
> 
> Minimal, this change affects only Xen and more exactly only the
> xen-netfront driver.
> 
> Changes in v2:
>   - add missing BugLink
>   - target the right kernels (B/aws-5.3 and F/aws)
> 
> 

I acked this before based on the code working fine and producing
positive test results, so..

Acked-by: Colin Ian King <colin.king at canonical.com>



More information about the kernel-team mailing list