NACK: [SRU X][PATCH 0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot

Thadeu Lima de Souza Cascardo cascardo at canonical.com
Thu Nov 8 11:02:06 UTC 2018


On Thu, Nov 08, 2018 at 10:11:54AM +0000, Andy Whitcroft wrote:
> On Wed, Nov 07, 2018 at 11:12:17PM -0200, Mauricio Faria de Oliveira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1797990
> > 
> > (note: the patch set for later releases will be sent soon)
> > 
> > [Impact]
> > 
> >  * A kexec/crash kernel might get stuck and fail to boot
> >    (for crash kernel, kdump fails to collect a crashdump)
> >    if a PCI device is buggy/stuck/looping and triggers a
> >    continuous flood of MSI(X) interrupts (that the kernel
> >    does not yet know about).
> > 
> >  * This fix allowed to obtain crashdumps when debugging a
> >    heavy-load scenario, in which a (heavy-loaded) network
> >    adapter wouldn't stop triggering MSI-X interrupts ever
> >    after panic()->kdump kicked in.
> > 
> >  * This fix disables MSI(X) in all PCI devices on early
> >    boot (this is OK as it's (re-)enabled normally later)
> >    with a kernel cmdline parameter (disabled by default).
> > 
> > [Test Case]
> > 
> >  * A synthetic test-case is not yet available, however,
> >    this particular system/workload triggered the problem
> >    consistently, and it was used for development/testing.
> > 
> >  * We'll update this bug once a synthetic test-case is
> >    available; we're working on patching QEMU for this.
> > 
> >  * $ dmesg | grep 'Clearing MSI'
> >    [    0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)
> > 
> >  * The comparison of 'dmesg -t | sort' has been reviewed
> >    between option disabled/enabled on boot & kexec modes,
> >    and only expected differences found (MHz, PIDs, MIPS).
> > 
> > [Regression Potential] 
> > 
> >  * The potential area for regressions is early boot,
> >    particularly effects of applying quirks during PCI
> >    bus scan, which is changed/broader w/ these patches.
> > 
> >  * However, all quirks are applied based on PCI ID
> >    matching, so would only apply if actually targeting
> >    a new device.
> > 
> >  * Moreover, the new quirk is only applied based on
> >    a kernel cmdline parameter that is disabled by
> >    default, which constraints even more when this
> >    is actually in effect.
> > 
> > [Other Info]
> >  
> >  * The patch series is still under review/discussion
> >    upstream, but it's relatively important for Ubuntu
> >    users at this point, and after internal discussions
> >    we decided to submit it for SRU.
> > 
> >  * These are links to the linux-pci archive with the
> >    patches [1, 2, 3]
> > 
> >    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
> >        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
> > 
> >    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
> >        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
> > 
> >    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
> >        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
> > 
> > Guilherme G. Piccoli (3):
> >   UBUNTU: SAUCE: x86/quirks: Scan all busses for early PCI quirks
> >   UBUNTU: SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
> >   UBUNTU: SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
> > 
> >  Documentation/kernel-parameters.txt |  6 ++++
> >  arch/x86/include/asm/pci-direct.h   |  2 ++
> >  arch/x86/kernel/aperture_64.c       | 30 ++-----------------
> >  arch/x86/kernel/early-quirks.c      | 45 ++++++++++++++++++++++++-----
> >  arch/x86/pci/common.c               |  4 +++
> >  arch/x86/pci/early.c                | 25 ++++++++++++++++
> >  6 files changed, 77 insertions(+), 35 deletions(-)
> > 
> 
> As the new functionality is off by default this should be a low risk.
> The only obvious remaining risk is that we scan all of the pci busses
> where we did not necessarily do so before.  And as the patch itself
> mentions this might have been done to avoid bugs in Nvidia chipsets.
> Likely some nvidia based and specific testing is warrented before this
> cycle releases.
> 
> Acked-by: Andy Whitcroft <apw at canonical.com>
> 
> -apw

This is a NACK from me. If we are gonna skip upstream review and
extensive system testing before applying, I would like to see the bus
scan change restricted to the command line option.

That is, please resubmit, but only scan all busses when the command line
option is given. If the command line option is not give, the kernel
behavior should be the same as before.

Cascardo.



More information about the kernel-team mailing list