NAK: [SRU][E/F][PATCH 0/1] xen-netfront: fix potential deadlock in xennet_remove()
andrea.righi at canonical.com
Thu Jul 23 13:06:17 UTC 2020
On Thu, Jul 23, 2020 at 09:15:17AM +0200, Stefan Bader wrote:
> On 22.07.20 15:47, Andrea Righi wrote:
> > [Impact]
> > During our AWS testing we were experiencing deadlocks on hibernate
> > across all Xen instance types. The trace was showing that the system was
> > stuck in xennet_remove():
> > [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> > [ 358.115102] modprobe D 0 4892 4833 0x00004004
> > [ 358.115104] Call Trace:
> > [ 358.115112] __schedule+0x2a8/0x670
> > [ 358.115115] schedule+0x33/0xa0
> > [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> > [ 358.115121] ? wait_woken+0x80/0x80
> > [ 358.115124] xenbus_dev_remove+0x51/0xa0
> > [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> > [ 358.115127] driver_detach+0x49/0x90
> > [ 358.115129] bus_remove_driver+0x59/0xd0
> > [ 358.115131] driver_unregister+0x2c/0x40
> > [ 358.115132] xenbus_unregister_driver+0x12/0x20
> > [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> > [ 358.115137] __x64_sys_delete_module+0x146/0x290
> > [ 358.115140] do_syscall_64+0x5a/0x130
> > [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > This prevented hibernation to complete.
> > The reason of this problem is a race condition in xennet_remove(): the
> > system is reading the current state of the bus, it's requesting to
> > change the state to "Closing", and it's waiting for the state to be
> > changed to "Closing". However, if the state becomes "Closed" between
> > reading the state and requesting the state change, we are stuck forever,
> > because the state will never change from "Closed" back to "Closing".
> > [Test case]
> > Create any Xen-based instance in AWS, hibernate/resume multiple times.
> > Some times the system gets stuck (hung task timeout) and hibernation
> > fails.
> > [Fix]
> > Prevent the deadlock by changing the wait condition to check also for
> > state == Closed.
> > This is also an upstream bug, I posted a patch to the LKML and I'm
> > waiting for a review / feedbacks:
> > https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> > This patch is not applied upstream, but both our tests and the tests
> > performed by Amazon show positive results after applying this fix (the
> > deadlock doesn't seem to happen anymore).
> > [Regression potential]
> > Minimal, this change affects only Xen and more exactly only the
> > xen-netfront driver.
> Beside the BugLink which we can add later, there is a different target source in
> the bug report vs. the submission. The bug report is against linux-aws, the
> submission would be against the main kernel.
> Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be
> handled with extra care (this is probably less for Andrea than for whoever is
> going to apply).
NACK-ing this patch. I'll send a new version targeting the proper
kernels and adding the BugLink.
More information about the kernel-team