APPLIED/cmt: [SRU X] [PATCH 0/1] NVMe polling on timeout

Khaled Elmously khalid.elmously at canonical.com
Wed Jan 9 09:28:07 UTC 2019


Changed
    (backported from 7776db1ccc123d5944a8c170c9c45f7e91d49643 upstream)
to
    (backported from commit 7776db1ccc123d5944a8c170c9c45f7e91d49643)

..which is how we normally do it


On 2018-12-07 19:28:10 , Guilherme G. Piccoli wrote:
> BugLink: https://launchpad.net/bugs/1807393
> 
> 
> [Impact]
> 
> * NVMe controllers potentially could miss to send an interrupt, specially
> due to bugs in virtual devices(which are common those days - qemu has its
> own NVMe virtual device, so does AWS). This would be a difficult to
> debug situation, because NVMe driver only reports the request timeout,
> not the reason.
> 
> * The upstream patch proposed to SRU here was designed to provide more
> information in these cases, by pro-actively polling the CQEs on request
> timeouts, to check if the specific request was completed and some issue
> (probably a missed interrupt) prevented the driver to notice, or if the
> request really wasn't completed, which indicates more severe issues.
> 
> * Although quite useful for debugging, this patch could help to mitigate
> issues in cloud environments like AWS, in case we may have jitter in
> request completion and the i/o timeout was set to low values, or even
> in case of atypical bugs in the virtual NVMe controller. With this patch,
> if polling succeeds the NVMe driver will continue working instead of
> trying a reset controller procedure, which may lead to fails in the
> rootfs - refer to https://launchpad.net/bugs/1788035.
> 
> 
> [Test Case]
> 
> * It's a bit tricky to artificially create a situation of missed
> interrupt; one idea was to implement a small hack in the NVMe qemu
> virtual device that given a trigger in guest kernel, will induce the
> virtual device to skip an interrupt. The hack patch is present in a
> Launchpad comment, along with instructions to reproduce.
> 
> 
> [Regression Potential]
> 
> * There are no clear risks in adding such polling mechanism to the NVMe driver;
> one bad thing that was neverreported but could happen with this patch is the
> device could be in a bad state IRQ-wise that a reset would fix, but
> the patch could cause all requests to be completed via polling, which
> prevents the adapter reset. This is however a very hypothetical situation,
> which would also happen in the mainline kernel (since it has the patch).
> 
> 
> Keith Busch (1):
>   nvme/pci: Poll CQ on timeout
> 
>  drivers/nvme/host/pci.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> -- 
> 2.19.2
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team



More information about the kernel-team mailing list