[SRU][Noble][PATCH 0/1] md: nvme over tcp with a striped underlying md raid device leads to data corruption
Matthew Ruffell
matthew.ruffell at canonical.com
Tue Jul 30 04:38:20 UTC 2024
BugLink: https://bugs.launchpad.net/bugs/2075110
[Impact]
There is a fault in the md subsystem where __write_sb_page() will round the io
size up to the optimal size, but it doesn't check to see if the final io size
exceeds the bitmap length.
This gets us into a situation where if we have 256K of io to submit, 64 pages
are needed. md_bitmap_storage_alloc() allocates 1 page, and 63 are allocated
afterward.
When we send md writes over the network, e.g. with nvme over tcp, the network
subsystem checks the first page which is sendpage_ok(), but not the other 63,
which might not be sendpage_ok(), and will get stuck, causing a hang and data
corruption.
If you trigger the issue, you get the following oops in dmesg:
WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 skb_splice_from_iter+0x139/0x370
CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
RIP: 0010:skb_splice_from_iter+0x139/0x370
CR2: 000072dab83e5f84
Call Trace:
<TASK>
? show_regs+0x6d/0x80
? __warn+0x89/0x160
? skb_splice_from_iter+0x139/0x370
? report_bug+0x17e/0x1b0
? handle_bug+0x51/0xa0
? exc_invalid_op+0x18/0x80
? asm_exc_invalid_op+0x1b/0x20
? skb_splice_from_iter+0x139/0x370
tcp_sendmsg_locked+0x352/0xd70
? tcp_push+0x159/0x190
? tcp_sendmsg_locked+0x9c4/0xd70
tcp_sendmsg+0x2c/0x50
inet_sendmsg+0x42/0x80
sock_sendmsg+0x118/0x150
nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
? __tcp_cleanup_rbuf+0xc5/0xe0
nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? _raw_spin_unlock_irqrestore+0x11/0x60
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
nvme nvme1: failed to send request -5
nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
nvme nvme1: starting error recovery
block nvme1n1: no usable path - requeuing I/O
nvme nvme1: Reconnecting in 10 seconds...
There is no workaround.
[Fix]
This was fixed in the below commit in 6.11-rc1:
commit ab99a87542f194f28e2364a42afbf9fb48b1c724
Author: Ofir Gal <ofir.gal at volumez.com>
Date: Fri Jun 7 10:27:44 2024 +0300
Subject: md/md-bitmap: fix writing non bitmap pages
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724
This is a clean cherry-pick to the Noble tree.
[Testcase]
This can be reproduced by running blktests md/001 [1], which the author of the
fix created to act as a regression test for this issue.
[1] https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822
Deploy a fresh Noble VM, that has a scratch NVME disk.
$ sudo apt install build-essential fio
$ git clone https://github.com/osandov/blktests.git
$ cd blktests
$ make
$ echo "TEST_DEVS=(/dev/nvme0n1)" > config
$ sudo ./check md/001
The md/001 test will hang an affected system, and the above oops message will
be visible in dmesg.
A test kernel is available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test
If you install the test kernel, the md/001 test will complete successfully, and
the issue will no longer appear.
[Where problems could occur]
We are changing how the md subsystem calculates final IO sizes, and taking the
smaller value of the size or the bitmap_limit. This makes sure we don't leak
the final page and corrupt data.
If a regression were to occur, it would likely affect all md users, but would
be more obvious to md users over the network, like nvme over tcp.
There is no workaround. Users would have to downgrade their kernels if a
regression occurs.
[Other info]
I checked Jammy 5.15 and it works fine, so the issue must have been introduced
later on. It is not needed for Focal or Jammy.
Ofir Gal (1):
md/md-bitmap: fix writing non bitmap pages
drivers/md/md-bitmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--
2.45.2
More information about the kernel-team
mailing list