APPLIED: [PULL REQUESTS][focal/jammy/lunar linux-azure] [Azure] Fix VM crash/hang issues due to fast VF add/remove events

Tim Gardner tim.gardner at canonical.com
Wed Jul 5 19:14:00 UTC 2023


On 6/12/23 3:13 PM, Tim Gardner wrote:
> BugLink: https://bugs.launchpad.net/bugs/2023594
> 
> SRU Justification
> 
> [Impact]
> 
> A Linux guest on Hyper-V/Azure can occasionally crash during early Linux 
> kernel boot due to a strange host behavior:
> 1. The host assigns a VF to the guest;
> 2. The host immediately unassigns the VF from the guest; //Dexuan: due 
> to some race conditions bug in Linux vPCI driver, Linux can crash.
> 3. The host assigns the VF to the guest again.
> 
> Starting late 2022 (around Nov 2022), Linux guests on Azure started to 
> crash more frequently due to a host side update at that time: a new 
> host/hypervisor feature of handling "correctable memory errors" can 
> cause a lot of successive VF remove/add events, so the race conditions 
> bug in Linux vPCI driver can surface much more easily. The Hyper-V team 
> is implementing a batching mechanism so that the guest will get much 
> less VF remove/add events (ETA: June 2023), but meanwhile we should also 
> get the Linux race condition bugs fixed so that Linux guests won't crash 
> even if it receives the successive VF remove/add events.
> 
> [Test Plan]
> 
> Microsoft tested
> 
> [Regression potential]
> 
> PCI devices may not get registered, or VMs may crash.
> 
> [Other Info]
> 
> SF: #00349076
> 
> -------------------------------------------------------------------------------
> The following changes since commit 
> d250cc0ce73d5582e5eb073fa948567ec2ef67d5:
> 
>    UBUNTU: Ubuntu-azure-5.4.0-1110.116 (2023-06-02 12:51:11 -0600)
> 
> are available in the Git repository at:
> 
>    git://git.launchpad.net/~timg-tpi/ubuntu/+source/linux/+git/focal 
> focal-azure-fix-vm-add-remove-race-condition
> 
> for you to fetch changes up to ac03c5aa4ead40832fcd94d814d28ea8087fb906:
> 
>    UBUNTU: SAUCE: PCI: hv: Add a per-bus mutex state_lock (2023-06-12 
> 14:57:33 -0600)
> 
> ----------------------------------------------------------------
> Dexuan Cui (4):
>        UBUNTU: SAUCE: PCI: hv: Fix a race condition bug in 
> hv_pci_query_relations()
>        UBUNTU: SAUCE: PCI: hv: Fix a race condition in hv_irq_unmask() 
> that can cause panic
>        UBUNTU: SAUCE: PCI: hv: Remove the useless hv_pcichild_state from 
> struct hv_pci_dev
>        UBUNTU: SAUCE: PCI: hv: Add a per-bus mutex state_lock
> 
>   drivers/pci/controller/pci-hyperv.c | 58 
> ++++++++++++++++++++++++++++++++++++++--------------------
>   1 file changed, 38 insertions(+), 20 deletions(-)
> -------------------------------------------------------------------------------
> 
> The following changes since commit 
> 98439f7092e414a0534c5687742e2c8309a30204:
> 
>    UBUNTU: Ubuntu-azure-5.15.0-1040.47 (2023-06-01 13:13:05 -0600)
> 
> are available in the Git repository at:
> 
>    git://git.launchpad.net/~timg-tpi/ubuntu/+source/linux/+git/jammy 
> jammy-azure-fix-vm-add-remove-race-condition
> 
> for you to fetch changes up to c7caafcc393e6435c8b6b24ecde574f30be95c35:
> 
>    PCI: hv: Use async probing to reduce boot time (2023-06-12 14:58:58 
> -0600)
> 
> ----------------------------------------------------------------
> Dexuan Cui (6):
>        PCI: hv: Fix a race condition bug in hv_pci_query_relations()
>        PCI: hv: Fix a race condition in hv_irq_unmask() that can cause 
> panic
>        PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
>        Revert "PCI: hv: Fix a timing issue which causes kdump to fail 
> occasionally"
>        PCI: hv: Add a per-bus mutex state_lock
>        PCI: hv: Use async probing to reduce boot time
> 
>   drivers/pci/controller/pci-hyperv.c | 145 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
>   1 file changed, 86 insertions(+), 59 deletions(-)
> -------------------------------------------------------------------------------
> 
> The following changes since commit 
> e4972fe7acaf327c1bd496cf2286889bd913c9bf:
> 
>    UBUNTU: Ubuntu-azure-6.2.0-1005.5 (2023-06-01 11:42:19 -0600)
> 
> are available in the Git repository at:
> 
> 
> git://git.launchpad.net/~timg-tpi/ubuntu/+source/linux-azure/+git/lunar 
> lunar-azure-fix-vm-add-remove-race-condition
> 
> for you to fetch changes up to 7e734194962a51084cba9016f0e5a512805f7d0a:
> 
>    PCI: hv: Use async probing to reduce boot time (2023-06-12 15:01:00 
> -0600)
> 
> ----------------------------------------------------------------
> Dexuan Cui (6):
>        PCI: hv: Fix a race condition bug in hv_pci_query_relations()
>        PCI: hv: Fix a race condition in hv_irq_unmask() that can cause 
> panic
>        PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
>        Revert "PCI: hv: Fix a timing issue which causes kdump to fail 
> occasionally"
>        PCI: hv: Add a per-bus mutex state_lock
>        PCI: hv: Use async probing to reduce boot time
> 
>   drivers/pci/controller/pci-hyperv.c | 145 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
>   1 file changed, 86 insertions(+), 59 deletions(-)
Applied to f/j/l linux-azure:master-next. Thanks.

-rtg
-- 
-----------
Tim Gardner
Canonical, Inc




More information about the kernel-team mailing list