ACK: [SRU][Noble][PATCH 0/1] md: nvme over tcp with a striped underlying md raid device leads to data corruption

Tue Jul 30 09:11:04 UTC 2024

On 30/07/2024 06:38, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2075110
>
> [Impact]
>
> There is a fault in the md subsystem where __write_sb_page() will round the io
> size up to the optimal size, but it doesn't check to see if the final io size
> exceeds the bitmap length.
>
> This gets us into a situation where if we have 256K of io to submit, 64 pages
> are needed. md_bitmap_storage_alloc() allocates 1 page, and 63 are allocated
> afterward.
>
> When we send md writes over the network, e.g. with nvme over tcp, the network
> subsystem checks the first page which is sendpage_ok(), but not the other 63,
> which might not be sendpage_ok(), and will get stuck, causing a hang and data
> corruption.
>
> If you trigger the issue, you get the following oops in dmesg:
>
> WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 skb_splice_from_iter+0x139/0x370
> CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
> Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> RIP: 0010:skb_splice_from_iter+0x139/0x370
> CR2: 000072dab83e5f84
> Call Trace:
>   <TASK>
>   ? show_regs+0x6d/0x80
>   ? __warn+0x89/0x160
>   ? skb_splice_from_iter+0x139/0x370
>   ? report_bug+0x17e/0x1b0
>   ? handle_bug+0x51/0xa0
>   ? exc_invalid_op+0x18/0x80
>   ? asm_exc_invalid_op+0x1b/0x20
>   ? skb_splice_from_iter+0x139/0x370
>   tcp_sendmsg_locked+0x352/0xd70
>   ? tcp_push+0x159/0x190
>   ? tcp_sendmsg_locked+0x9c4/0xd70
>   tcp_sendmsg+0x2c/0x50
>   inet_sendmsg+0x42/0x80
>   sock_sendmsg+0x118/0x150
>   nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
>   ? __tcp_cleanup_rbuf+0xc5/0xe0
>   nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
>   nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
>   process_one_work+0x16c/0x350
>   worker_thread+0x306/0x440
>   ? _raw_spin_unlock_irqrestore+0x11/0x60
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xef/0x120
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x44/0x70
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1b/0x30
>   </TASK>
> nvme nvme1: failed to send request -5
> nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
> nvme nvme1: starting error recovery
> block nvme1n1: no usable path - requeuing I/O
> nvme nvme1: Reconnecting in 10 seconds...
>
> There is no workaround.
>
> [Fix]
>
> This was fixed in the below commit in 6.11-rc1:
>
> commit ab99a87542f194f28e2364a42afbf9fb48b1c724
> Author: Ofir Gal <ofir.gal at volumez.com>
> Date:  Fri Jun 7 10:27:44 2024 +0300
> Subject: md/md-bitmap: fix writing non bitmap pages
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724
>
> This is a clean cherry-pick to the Noble tree.
>
> [Testcase]
>
> This can be reproduced by running blktests md/001 [1], which the author of the
> fix created to act as a regression test for this issue.
>
> [1] https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822
>
> Deploy a fresh Noble VM, that has a scratch NVME disk.
>
> $ sudo apt install build-essential fio
> $ git clone https://github.com/osandov/blktests.git
> $ cd blktests
> $ make
> $ echo "TEST_DEVS=(/dev/nvme0n1)" > config
> $ sudo ./check md/001
>
> The md/001 test will hang an affected system, and the above oops message will
> be visible in dmesg.
>
> A test kernel is available in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test
>
> If you install the test kernel, the md/001 test will complete successfully, and
> the issue will no longer appear.
>
> [Where problems could occur]
>
> We are changing how the md subsystem calculates final IO sizes, and taking the
> smaller value of the size or the bitmap_limit. This makes sure we don't leak
> the final page and corrupt data.
>
> If a regression were to occur, it would likely affect all md users, but would
> be more obvious to md users over the network, like nvme over tcp.
>
> There is no workaround. Users would have to downgrade their kernels if a
> regression occurs.
>
> [Other info]
>
> I checked Jammy 5.15 and it works fine, so the issue must have been introduced
> later on. It is not needed for Focal or Jammy.
>
> Ofir Gal (1):
>    md/md-bitmap: fix writing non bitmap pages
>
>   drivers/md/md-bitmap.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
Acked-by: Roxana Nicolescu <roxana.nicolescu at canonical.com>