APPLIED: [PATCH 0/2][focal/linux-azure] Azure: Mellanox VF NIC crashes when removed
Tim Gardner
tim.gardner at canonical.com
Fri May 20 14:57:06 UTC 2022
Applied to focal/linux-azure:master-next. Thanks.
-rtg
On 5/17/22 08:22, Tim Gardner wrote:
> BugLink: https://bugs.launchpad.net/bugs/1973758
>
> SRU Justification
>
> [Impact]
>
> The 5.4.0-1075-azure and newer kernels are broken in that the VM can easily panic
> when the Mellanox VF NIC is removed and added due to Azure host servicing events
> or the below manual "unbind/bind" test (here the GUID can be different in
> different VMs):
>
> for i in `seq 1 1000`;
> do
> cd /sys/bus/vmbus/drivers/hv_pci;
> echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind;
> echo abdc2107-402e-4704-8c88-c2b850696c3c > bind;
> done
>
> A sample panic call-trace is:
> [ 107.359954] kernel BUG at /build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020!
> [ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI
> [ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure #80~18.04.1-Ubuntu
> [ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
> [ 107.373811] Workqueue: events vmbus_onmessage_work
> [ 107.375909] RIP: 0010:kfree+0x1d2/0x240
> …
> [ 107.413789] Call Trace:
> [ 107.414867] kobject_uevent_env+0x1b5/0x7e0
> [ 107.416747] kobject_uevent+0xb/0x10
> [ 107.418327] device_release_driver_internal+0x191/0x1c0
> [ 107.420653] device_release_driver+0x12/0x20
> [ 107.422523] bus_remove_device+0xe1/0x150
> [ 107.424279] device_del+0x167/0x380
> [ 107.425824] device_unregister+0x1a/0x60
> [ 107.427536] vmbus_device_unregister+0x27/0x50
> [ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0
> [ 107.431474] vmbus_onmessage+0x2c/0x70
> [ 107.433104] vmbus_onmessage_work+0x22/0x30
> [ 107.434919] process_one_work+0x209/0x400
> [ 107.436661] worker_thread+0x34/0x40
>
> It turns out there is a bug in
> https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/bionic/commit/?id=16a3c750a78d8,
> which misses the second hunk of the upstream patch
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0.
>
> Please apply the below patch to fix the issue:
>
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev)
>
> hv_put_dom_num(hbus->bridge->domain_nr);
>
> - free_page((unsigned long)hbus);
> + kfree(hbus);
> return ret;
> }
>
> BTW, please apply this patch as well (Note: this patch is not really required as
> it's only for error handling path, which is usually unlikely):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf
>
> [Test Case]
>
> Microsoft tested
>
> [Other Info]
>
> SF: #00336939
>
--
-----------
Tim Gardner
Canonical, Inc
More information about the kernel-team
mailing list