APPLIED/cmt: [SRU X] [PATCH 0/1] NVMe polling on timeout
Khaled Elmously
khalid.elmously at canonical.com
Wed Jan 9 09:28:07 UTC 2019
Changed
(backported from 7776db1ccc123d5944a8c170c9c45f7e91d49643 upstream)
to
(backported from commit 7776db1ccc123d5944a8c170c9c45f7e91d49643)
..which is how we normally do it
On 2018-12-07 19:28:10 , Guilherme G. Piccoli wrote:
> BugLink: https://launchpad.net/bugs/1807393
>
>
> [Impact]
>
> * NVMe controllers potentially could miss to send an interrupt, specially
> due to bugs in virtual devices(which are common those days - qemu has its
> own NVMe virtual device, so does AWS). This would be a difficult to
> debug situation, because NVMe driver only reports the request timeout,
> not the reason.
>
> * The upstream patch proposed to SRU here was designed to provide more
> information in these cases, by pro-actively polling the CQEs on request
> timeouts, to check if the specific request was completed and some issue
> (probably a missed interrupt) prevented the driver to notice, or if the
> request really wasn't completed, which indicates more severe issues.
>
> * Although quite useful for debugging, this patch could help to mitigate
> issues in cloud environments like AWS, in case we may have jitter in
> request completion and the i/o timeout was set to low values, or even
> in case of atypical bugs in the virtual NVMe controller. With this patch,
> if polling succeeds the NVMe driver will continue working instead of
> trying a reset controller procedure, which may lead to fails in the
> rootfs - refer to https://launchpad.net/bugs/1788035.
>
>
> [Test Case]
>
> * It's a bit tricky to artificially create a situation of missed
> interrupt; one idea was to implement a small hack in the NVMe qemu
> virtual device that given a trigger in guest kernel, will induce the
> virtual device to skip an interrupt. The hack patch is present in a
> Launchpad comment, along with instructions to reproduce.
>
>
> [Regression Potential]
>
> * There are no clear risks in adding such polling mechanism to the NVMe driver;
> one bad thing that was neverreported but could happen with this patch is the
> device could be in a bad state IRQ-wise that a reset would fix, but
> the patch could cause all requests to be completed via polling, which
> prevents the adapter reset. This is however a very hypothetical situation,
> which would also happen in the mainline kernel (since it has the patch).
>
>
> Keith Busch (1):
> nvme/pci: Poll CQ on timeout
>
> drivers/nvme/host/pci.c | 21 ++++++++++++++++++---
> 1 file changed, 18 insertions(+), 3 deletions(-)
>
> --
> 2.19.2
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list