ACK/cmnt: [Trusty][PATCH] UBUNTU: SAUCE: xen: do not re-use pirq number cached in pci device msi msg data

Wed Jan 18 18:19:14 UTC 2017

On Wed, Jan 18, 2017 at 11:42 AM, Stefan Bader
<stefan.bader at canonical.com> wrote:
> On 18.01.2017 17:08, Dan Streetman wrote:
>> On Mon, Jan 16, 2017 at 6:34 AM, Stefan Bader
>> <stefan.bader at canonical.com> wrote:
>>> On 13.01.2017 22:13, Dan Streetman wrote:
>>>> BugLink: http://bugs.launchpad.net/bugs/1656381
>>>>
>>>> Revert the main part of commit:
>>>> af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>>
>>>> That commit introduced reading the pci device's msi message data to see
>>>> if a pirq was previously configured for the device's msi/msix, and re-use
>>>> that pirq.  At the time, that was the correct behavior.  However, a
>>>> later change to Qemu caused it to call into the Xen hypervisor to unmap
>>>> all pirqs for a pci device, when the pci device disables its MSI/MSIX
>>>> vectors; specifically the Qemu commit:
>>>> c976437c7dba9c7444fb41df45468968aaa326ad
>>>> ("qemu-xen: free all the pirqs for msi/msix when driver unload")
>>>>
>>>> Once Qemu added this pirq unmapping, it was no longer correct for the
>>>> kernel to re-use the pirq number cached in the pci device msi message
>>>> data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
>>>> pirqs when the pci device disables its MSI/MSIX vectors.
>>>
>>> In Trusty release we have qemu-2.0.0. I wonder whether adding the change to the
>>> kernel (hm, actually same would apply to xenial via hwe path) would cause
>>> regressions when not using the cloud-archive...
>>
>> Yes, but applying this patch to the Trusty kernel is completely
>> unrelated to a Trusty-based Xen hypervisor.  This change is to the
>> guest kernel.  Meaning, if we don't apply this kernel patch to the
>> Trusty kernel, X/Y/Z patched kernels will still see problems if
>> they're running under a Trusty-based Xen hypervisor (i.e. qemu 2.0.0
>> or earlier), while T unpatched kernel will see problems if it's
>> running under a newer (Vivid or later, or UCA Kilo or later) Xen
>> hypervisor.
>
> You are right. I just started to think about compat issues while looking at the
> Trusty kernel and then wondering which qemu version would be around in that release.
> Put in simple words. Patching any kernel will allow a guest VM to properly use
> pci-passthrough on any host/hypervisor that uses a qemu-2.1 onwards. And breaks
> it on older systems. Which means Precise (too old to worry too much) and Trusty
> based Xen hosts (not having cloud-archive enabled). And I do not have real
> numbers about how often there is the case of people running Trusty Xen +
> pci-passthrough. Guess we will see.

Yep, exactly.

> Maybe the other unknown is what versions of qemu run on which instance type
> hosts on AWS. Hopefully recent enough ones on those that offer pass-through...

I'm checking on that now, but as AWS doesn't publicly discuss its
hypervisor implementation, I think it's unlikely we'll find out and/or
be able to discuss that.  But I have let AWS know that they will need
to patch any of their hypervisors that use qemu 2.0.0 or earlier (if
they have any with a qemu that old) - if they haven't already patched
it, which they may have.

>
> But overall no reason to not change the Trusty kernel, too.
>
> -Stefan
>>
>> Applying this to T as well as X/Y/Z means, all of them will work right
>> under qemu 2.1.0 or later Xen hypervisor, and all of them will not
>> work right under qemu 2.0.0 or earlier Xen hypervisor.
>>
>>> But then, changing the 2.0.0 qemu to add this unmapping might cause other
>>> unexpected problems...
>>
>> I opened bug 1657489 to track fixing Trusty qemu.
>> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1657489
>>
>> I included a matrix there of hypervisor version and guest kernel
>> version, which I'll include here.
>>
>> 1) qemu not patched (2.0.0 and earlier), guest kernel not patched:
>> CORRECT behavior
>> hypervisor: Trusty qemu or UCA Icehouse qemu
>> guest: all without patch from bug 1656381
>> failure: none
>>
>> 2) qemu not patched (2.0.0 and earlier), guest kernel patched:
>> INCORRECT behavior
>> hypervisor: Trusty qemu or UCA Icehouse qemu
>> guest: all with patch from bug 1656381
>> failure: MSI interrupts will fail to be configured for any device, if
>> the device disables and then re-enables its MSI. Only the first time a
>> device enables MSI will work. For example, unloading a driver will
>> result in failure to enable MSI when the driver is reloaded.
>>
>> 3) qemu patched (2.1.0 and later), guest kernel not patched: INCORRECT behavior
>> hypervisor: Vivid or later qemu, or UCA Kilo or later qemu
>> guest: all without patch from bug 1656381
>> failure: MSI interrupts in the guest may not be correctly mapped if
>> device B enables its MSI after device A has disabled its MSI; when
>> device A re-enables its MSI, some of its interrupts will fail to be
>> configured correctly. NVMe shows this repeatedly with multiple NVMe
>> controllers; usually only 1 NVMe controller will finish initialization
>> correctly.
>>
>> 4) qemu patched (2.1.0 and later), guest kernel patched: CORRECT behavior
>> hypervisor: Vivid or later qemu, or UCA Kilo or later qemu
>> guest: all with patch from bug 1656381
>> failure: none
>>
>>
>>>
>>> -Stefan
>>>
>>>>
>>>> This bug is causing failures to initialize multiple NVMe controllers
>>>> under Xen, because the NVMe driver sets up a single MSIX vector for
>>>> each controller (concurrently), and then after using that to talk to
>>>> the controller for some configuration data, it disables the single MSIX
>>>> vector and re-configures all the MSIX vectors it needs.  So the MSIX
>>>> setup code tries to re-use the cached pirq from the first vector
>>>> for each controller, but the hypervisor has already given away that
>>>> pirq to another controller, and its initialization fails.
>>>>
>>>> This is discussed in more detail at:
>>>> https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
>>>>
>>>> Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
>>>> Signed-off-by: Dan Streetman <dan.streetman at canonical.com>
>>>> (backported from X/Y/Z patch for context/function name changes only)
>>>>
>>>> ---
>>>>  arch/x86/pci/xen.c | 23 +++++++----------------
>>>>  1 file changed, 7 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>>>> index 5eee495..6969c76 100644
>>>> --- a/arch/x86/pci/xen.c
>>>> +++ b/arch/x86/pci/xen.c
>>>> @@ -227,23 +227,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>>>>               return 1;
>>>>
>>>>       list_for_each_entry(msidesc, &dev->msi_list, list) {
>>>> -             __read_msi_msg(msidesc, &msg);
>>>> -             pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>>>> -                     ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>>>> -             if (msg.data != XEN_PIRQ_MSI_DATA ||
>>>> -                 xen_irq_from_pirq(pirq) < 0) {
>>>> -                     pirq = xen_allocate_pirq_msi(dev, msidesc);
>>>> -                     if (pirq < 0) {
>>>> -                             irq = -ENODEV;
>>>> -                             goto error;
>>>> -                     }
>>>> -                     xen_msi_compose_msg(dev, pirq, &msg);
>>>> -                     __write_msi_msg(msidesc, &msg);
>>>> -                     dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>>> -             } else {
>>>> -                     dev_dbg(&dev->dev,
>>>> -                             "xen: msi already bound to pirq=%d\n", pirq);
>>>> +             pirq = xen_allocate_pirq_msi(dev, msidesc);
>>>> +             if (pirq < 0) {
>>>> +                     irq = -ENODEV;
>>>> +                     goto error;
>>>>               }
>>>> +             xen_msi_compose_msg(dev, pirq, &msg);
>>>> +             __write_msi_msg(msidesc, &msg);
>>>> +             dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>>>               irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>>>>                                              (type == PCI_CAP_ID_MSIX) ?
>>>>                                              "msi-x" : "msi",
>>>>
>>>
>>>
>>>
>>> --
>>> kernel-team mailing list
>>> kernel-team at lists.ubuntu.com
>>> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>>>
>
>