ACK/Cmnt: [SRU][F][I][J][PATCH 0/6] nbd: requests can become stuck when disconnecting from server with qemu-nbd

Wed Jun 22 12:34:50 UTC 2022

On 6/21/22 23:17, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896350
> 
> [Impact]
> 
> After 2516ab1("nbd: only clear the queue on device teardown"), present in
> 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
> being processed. This change was made to fix a race between using the
> NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
> requests. This worked for the most part, as several years ago systemd was
> not set up to watch nbd devices for changes in their state.
> 
> But after:
> 
> commit f82abfcda58168d9f667e2094d438763531d3fa6
> From: Tony Asleson <tasleson at redhat.com>
> Date: Fri, 8 Feb 2019 15:47:10 -0600
> Subject: rules: watch metadata changes on nbd devices
> Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6
> 
> in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
> devices for changes with the inotify subsystem. From man udev:
> 
>> watch
>>    Watch the device node with inotify; when the node is closed after being
>>    opened for writing, a change uevent is synthesized.
>>
>> nowatch
>>    Disable the watching of a device node with inotify.
> 
> This changed the behaviour of device teardown, since systemd now keeps tabs on
> the device with inotify, outstanding requests cannot be cleared as
> nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
> stuck, never to complete, because a disconnect has occurred, and never to
> timeout, as their timers keep being reset.
> 
> Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
> 
> block nbd15: NBD_DISCONNECT
> block nbd15: Send disconnect failed -32
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read at 523915264,24576B). Runtime 30 seconds
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read at 523915264,24576B). Runtime 150 seconds
> ...
> INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
>        Not tainted 5.15.0-23-generic #23-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 flags:0x00000002
> Call Trace:
>   <TASK>
>   __schedule+0x23d/0x590
>   ? call_rcu+0xe/0x10
>   schedule+0x4e/0xb0
>   blk_mq_freeze_queue_wait+0x69/0xa0
>   ? wait_woken+0x70/0x70
>   blk_mq_freeze_queue+0x1b/0x30
>   nbd_add_socket+0x76/0x1f0 [nbd]
>   __nbd_ioctl+0x18b/0x340 [nbd]
>   ? security_capable+0x3d/0x60
>   nbd_ioctl+0x81/0xb0 [nbd]
>   blkdev_ioctl+0x12e/0x270
>   ? __fget_files+0x86/0xc0
>   block_ioctl+0x46/0x50
>   __x64_sys_ioctl+0x91/0xc0
>   do_syscall_64+0x5c/0xc0
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>   </TASK>
> 
> Additionally, in syslog you will also see systemd-udevd get stuck:
> 
> systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
> 
> $ ps aux
> ...
> 419    1194 root     D     0.1 systemd-udevd   -
> 
> We can workaround the issue by adding a higher priority udev rule to not watch
> nbd* devices.
> 
> $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
> # Disable inotify watching of change events for NBD devices
> ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
> EOF
> 
> $ sudo udevadm control --reload-rules
> $ sudo udevadm trigger
> 
> [Fix]
> 
> The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
> was introduced in 5.16, and added to in 5.19. We need to backport all commits
> related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
> 
> For Focal, Impish and Jammy:
> 
> commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date:   Thu Sep 16 17:33:44 2021 +0800
> Subject: nbd: don't handle response without a corresponding request message
> Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> 
> commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date:   Thu Sep 16 17:33:45 2021 +0800
> Subject: nbd: make sure request completion won't concurrent
> Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9
> 
> commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date:   Sat May 21 15:37:46 2022 +0800
> Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> 
> commit 09dadb5985023e27d4740ebd17e6fea4640110e5
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date:   Sat May 21 15:37:47 2022 +0800
> Subject: nbd: fix io hung while disconnecting device
> Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5
> 
> For Focal only (dependency commits):
> 
> commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
> Author: Keith Busch <kbusch at kernel.org>
> Date:   Fri May 29 07:51:59 2020 -0700
> Subject: blk-mq: blk-mq: provide forced completion method
> Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d
> 
> commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> Author: Christoph Hellwig <hch at lst.de>
> Date:   Thu Jun 11 08:44:47 2020 +0200
> Subject: blk-mq: move failure injection out of blk_mq_complete_request
> Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> 
> I want to talk about the backport of "blk-mq: move failure injection out of
> blk_mq_complete_request" for Focal, since it changes a number of drivers using
> the blk_mq_complete_request() call, and can be considered a large regression
> risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
> enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
> CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
> blk_should_fake_timeout() just returns false, and is more or less a nop on our
> kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
> really just if(true), and I did think about simply backporting that to the
> nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> but after talking with Jay, we decided backporting the entire patch was the best
> way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
> custom kernel build with the nbd subsystem.
> 
> Note: Bionic is also affected, but the backport list to the 4.15 kernel is
> unreasonable, and due to systemd in Bionic not having the udev rule set to watch
> nbd* devices by default, Bionic is not directly affected by the issue. Users
> would only see the problem if they add nbd* to the watch list in
> /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
> later LXC container running using the hosts nbd kernel module. Because of this
> Bionic will be Won't Fix.
> 
> [Testcase]
> 
> The issue can be easily reproduced with:
> 
> $ sudo apt install qemu-utils
> 
> $ cat << EOF >> reproducer.sh
> #!/bin/bash
> 
> sudo modprobe nbd
> 
> while :
> do
>          qemu-img create -f qcow2 foo.img 500M
>          sudo qemu-nbd --disconnect /dev/nbd15 || true
>          sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
>          sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
>          sudo qemu-nbd --disconnect /dev/nbd15
> done
> EOF
> 
> $ chmod +x reproducer.sh
> $ yes | ./reproducer.sh
> 
> On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
> so, and the kernel log will be filled with stuck request messages and hung
> task timeouts after the 120 second mark.
> 
> Test kernels are available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test
> 
> If you install these test kernels, and re-run the reproducer, your system will
> be perfectly stable for many hours with no issues.
> 
> I have torture tested the backport to the Focal kernel for more than 4 hours
> with no issues observed.
> 
> [Where problems can occur]
> 
> Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
> to the NBD subsystem is quite large, and changes how requests are processed
> and accounted for, depending if they are still outstanding or not.
> 
> For Impish and Jammy, these risks are limited to the NBD subsystem itself.
> 
> On Focal, the risk of regression is slightly larger due to the backport of
> "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
> mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
> being disabled on the Focal kernel.
> 
> If a regression were to occur, users may see NBD requests fail, occur in the
> incorrect order, or cleared at incorrect times. There is no way to disable
> NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
> kernel while a fix is made.
> 
> [Other info]
> 
> My bug report to upstream can be found in the following mailing list thread:
> https://lkml.org/lkml/2022/4/22/61
> 
> You may be wondering why we cannot simply just backport "nbd: fix io hung while
> disconnecting device" and resolve the issue with a 1 line change. I actually
> built test kernels for this scenario to see if it was possible, and they are
> available here:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single
> 
> While it did sort of fix the issue, that is, connect and disconnect from nbd
> devices lasted longer than the 30 seconds the kernels lasted previously, after
> about 10 minutes of torture testing, I started experiencing race conditions
> of requests completing multiple times, leading to the following use after frees:
> 
> Jammy 5.15 test kernel:
> https://paste.ubuntu.com/p/FSM6DrgjTy/
> 
> Impish 5.13 test kernel:
> https://paste.ubuntu.com/p/86tNfsM7Vs/
> 
> Focal 5.4 test kernel:
> https://paste.ubuntu.com/p/zzVzWx23sb/
> 
> This was similar to a mailing list thread I found previously:
> https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
> https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/
> 
> Keith Busch identified this as a double completion of the same request,
> which is resolved through the NBD_CMD_INFLIGHT infrastructure.
> 
> Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
> infrastructure for a complete fix.
> 
> Christoph Hellwig (1):
>    blk-mq: move failure injection out of blk_mq_complete_request
> 
> Keith Busch (1):
>    blk-mq: blk-mq: provide forced completion method
> 
> Yu Kuai (4):
>    nbd: don't handle response without a corresponding request message
>    nbd: make sure request completion won't concurrent
>    nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>    nbd: fix io hung while disconnecting device
> 
>   block/blk-mq.c                    | 27 ++++++------------
>   block/blk-timeout.c               |  6 ++--
>   block/blk.h                       |  8 ------
>   block/bsg-lib.c                   |  5 +++-
>   drivers/block/loop.c              |  6 ++--
>   drivers/block/mtip32xx/mtip32xx.c |  3 +-
>   drivers/block/nbd.c               | 46 +++++++++++++++++++++++++++++--
>   drivers/block/null_blk_main.c     |  3 +-
>   drivers/block/skd_main.c          |  9 ++++--
>   drivers/block/virtio_blk.c        |  3 +-
>   drivers/block/xen-blkfront.c      |  3 +-
>   drivers/md/dm-rq.c                |  3 +-
>   drivers/mmc/core/block.c          |  6 ++--
>   drivers/nvme/host/nvme.h          |  3 +-
>   drivers/s390/block/dasd.c         |  2 +-
>   drivers/s390/block/scm_blk.c      |  3 +-
>   drivers/scsi/scsi_lib.c           | 12 ++------
>   include/linux/blk-mq.h            | 11 +++++++-
>   18 files changed, 99 insertions(+), 60 deletions(-)
> 
Acked-by: Tim Gardner <tim.gardner at canonical.com>

Nice work. I know this bug has been going on for awhile.

-- 
-----------
Tim Gardner
Canonical, Inc