APPLIED: [SRU][F][PATCH V2 0/6] nbd: requests can become stuck when disconnecting from server with qemu-nbd
Stefan Bader
stefan.bader at canonical.com
Mon Aug 8 09:52:35 UTC 2022
On 20.07.22 05:12, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896350
>
> [Impact]
>
> After 2516ab1("nbd: only clear the queue on device teardown"), present in
> 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
> being processed. This change was made to fix a race between using the
> NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
> requests. This worked for the most part, as several years ago systemd was
> not set up to watch nbd devices for changes in their state.
>
> But after:
>
> commit f82abfcda58168d9f667e2094d438763531d3fa6
> From: Tony Asleson <tasleson at redhat.com>
> Date: Fri, 8 Feb 2019 15:47:10 -0600
> Subject: rules: watch metadata changes on nbd devices
> Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6
>
> in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
> devices for changes with the inotify subsystem. From man udev:
>
>> watch
>> Watch the device node with inotify; when the node is closed after being
>> opened for writing, a change uevent is synthesized.
>>
>> nowatch
>> Disable the watching of a device node with inotify.
>
> This changed the behaviour of device teardown, since systemd now keeps tabs on
> the device with inotify, outstanding requests cannot be cleared as
> nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
> stuck, never to complete, because a disconnect has occurred, and never to
> timeout, as their timers keep being reset.
>
> Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
>
> block nbd15: NBD_DISCONNECT
> block nbd15: Send disconnect failed -32
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read at 523915264,24576B). Runtime 30 seconds
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read at 523915264,24576B). Runtime 150 seconds
> ...
> INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
> Not tainted 5.15.0-23-generic #23-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002
> Call Trace:
> <TASK>
> __schedule+0x23d/0x590
> ? call_rcu+0xe/0x10
> schedule+0x4e/0xb0
> blk_mq_freeze_queue_wait+0x69/0xa0
> ? wait_woken+0x70/0x70
> blk_mq_freeze_queue+0x1b/0x30
> nbd_add_socket+0x76/0x1f0 [nbd]
> __nbd_ioctl+0x18b/0x340 [nbd]
> ? security_capable+0x3d/0x60
> nbd_ioctl+0x81/0xb0 [nbd]
> blkdev_ioctl+0x12e/0x270
> ? __fget_files+0x86/0xc0
> block_ioctl+0x46/0x50
> __x64_sys_ioctl+0x91/0xc0
> do_syscall_64+0x5c/0xc0
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> </TASK>
>
> Additionally, in syslog you will also see systemd-udevd get stuck:
>
> systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
>
> $ ps aux
> ...
> 419 1194 root D 0.1 systemd-udevd -
>
> We can workaround the issue by adding a higher priority udev rule to not watch
> nbd* devices.
>
> $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
> # Disable inotify watching of change events for NBD devices
> ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
> EOF
>
> $ sudo udevadm control --reload-rules
> $ sudo udevadm trigger
>
> [Fix]
>
> The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
> was introduced in 5.16, and added to in 5.19. We need to backport all commits
> related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
>
> For Focal, Impish and Jammy:
>
> commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date: Thu Sep 16 17:33:44 2021 +0800
> Subject: nbd: don't handle response without a corresponding request message
> Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
>
> commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date: Thu Sep 16 17:33:45 2021 +0800
> Subject: nbd: make sure request completion won't concurrent
> Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9
>
> commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date: Sat May 21 15:37:46 2022 +0800
> Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e
>
> commit 09dadb5985023e27d4740ebd17e6fea4640110e5
> Author: Yu Kuai <yukuai3 at huawei.com>
> Date: Sat May 21 15:37:47 2022 +0800
> Subject: nbd: fix io hung while disconnecting device
> Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5
>
> For Focal only (dependency commits):
>
> commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
> Author: Keith Busch <kbusch at kernel.org>
> Date: Fri May 29 07:51:59 2020 -0700
> Subject: blk-mq: blk-mq: provide forced completion method
> Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d
>
> commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> Author: Christoph Hellwig <hch at lst.de>
> Date: Thu Jun 11 08:44:47 2020 +0200
> Subject: blk-mq: move failure injection out of blk_mq_complete_request
> Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
>
> I want to talk about the backport of "blk-mq: move failure injection out of
> blk_mq_complete_request" for Focal, since it changes a number of drivers using
> the blk_mq_complete_request() call, and can be considered a large regression
> risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
> enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
> CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
> blk_should_fake_timeout() just returns false, and is more or less a nop on our
> kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
> really just if(true), and I did think about simply backporting that to the
> nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> but after talking with Jay, we decided backporting the entire patch was the best
> way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
> custom kernel build with the nbd subsystem.
>
> Note: Bionic is also affected, but the backport list to the 4.15 kernel is
> unreasonable, and due to systemd in Bionic not having the udev rule set to watch
> nbd* devices by default, Bionic is not directly affected by the issue. Users
> would only see the problem if they add nbd* to the watch list in
> /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
> later LXC container running using the hosts nbd kernel module. Because of this
> Bionic will be Won't Fix.
>
> [Testcase]
>
> The issue can be easily reproduced with:
>
> $ sudo apt install qemu-utils
>
> $ cat << EOF >> reproducer.sh
> #!/bin/bash
>
> sudo modprobe nbd
>
> while :
> do
> qemu-img create -f qcow2 foo.img 500M
> sudo qemu-nbd --disconnect /dev/nbd15 || true
> sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
> sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
> sudo qemu-nbd --disconnect /dev/nbd15
> done
> EOF
>
> $ chmod +x reproducer.sh
> $ yes | ./reproducer.sh
>
> On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
> so, and the kernel log will be filled with stuck request messages and hung
> task timeouts after the 120 second mark.
>
> Test kernels are available in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test
>
> If you install these test kernels, and re-run the reproducer, your system will
> be perfectly stable for many hours with no issues.
>
> I have torture tested the backport to the Focal kernel for more than 4 hours
> with no issues observed.
>
> [Where problems can occur]
>
> Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
> to the NBD subsystem is quite large, and changes how requests are processed
> and accounted for, depending if they are still outstanding or not.
>
> For Impish and Jammy, these risks are limited to the NBD subsystem itself.
>
> On Focal, the risk of regression is slightly larger due to the backport of
> "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
> mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
> being disabled on the Focal kernel.
>
> If a regression were to occur, users may see NBD requests fail, occur in the
> incorrect order, or cleared at incorrect times. There is no way to disable
> NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
> kernel while a fix is made.
>
> [Other info]
>
> My bug report to upstream can be found in the following mailing list thread:
> https://lkml.org/lkml/2022/4/22/61
>
> You may be wondering why we cannot simply just backport "nbd: fix io hung while
> disconnecting device" and resolve the issue with a 1 line change. I actually
> built test kernels for this scenario to see if it was possible, and they are
> available here:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single
>
> While it did sort of fix the issue, that is, connect and disconnect from nbd
> devices lasted longer than the 30 seconds the kernels lasted previously, after
> about 10 minutes of torture testing, I started experiencing race conditions
> of requests completing multiple times, leading to the following use after frees:
>
> Jammy 5.15 test kernel:
> https://paste.ubuntu.com/p/FSM6DrgjTy/
>
> Impish 5.13 test kernel:
> https://paste.ubuntu.com/p/86tNfsM7Vs/
>
> Focal 5.4 test kernel:
> https://paste.ubuntu.com/p/zzVzWx23sb/
>
> This was similar to a mailing list thread I found previously:
> https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
> https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/
>
> Keith Busch identified this as a double completion of the same request,
> which is resolved through the NBD_CMD_INFLIGHT infrastructure.
>
> Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
> infrastructure for a complete fix.
>
> Christoph Hellwig (1):
> blk-mq: move failure injection out of blk_mq_complete_request
>
> Keith Busch (1):
> blk-mq: blk-mq: provide forced completion method
>
> Yu Kuai (4):
> nbd: don't handle response without a corresponding request message
> nbd: make sure request completion won't concurrent
> nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> nbd: fix io hung while disconnecting device
>
> block/blk-mq.c | 27 ++++++------------
> block/blk-timeout.c | 6 ++--
> block/blk.h | 8 ------
> block/bsg-lib.c | 5 +++-
> drivers/block/loop.c | 6 ++--
> drivers/block/mtip32xx/mtip32xx.c | 3 +-
> drivers/block/nbd.c | 46 +++++++++++++++++++++++++++++--
> drivers/block/null_blk_main.c | 3 +-
> drivers/block/skd_main.c | 9 ++++--
> drivers/block/virtio_blk.c | 3 +-
> drivers/block/xen-blkfront.c | 3 +-
> drivers/md/dm-rq.c | 3 +-
> drivers/mmc/core/block.c | 6 ++--
> drivers/nvme/host/nvme.h | 3 +-
> drivers/s390/block/dasd.c | 2 +-
> drivers/s390/block/scm_blk.c | 3 +-
> drivers/scsi/scsi_lib.c | 12 ++------
> include/linux/blk-mq.h | 11 +++++++-
> 18 files changed, 99 insertions(+), 60 deletions(-)
>
Applied to focal:linux/master-next. Thanks.
-Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20220808/e3fa8044/attachment-0001.sig>
More information about the kernel-team
mailing list