[SRU][B][F][G][H][PATCH 0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Thu May 6 04:04:31 UTC 2021

BugLink: https://bugs.launchpad.net/bugs/1896578

[Impact]

Block discard is very slow on Raid10, which causes common use cases which invoke
block discard, such as mkfs and fstrim operations, to take a very long time.

For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.

The bigger the devices, the longer it takes.

The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the 
request into millions of 512k bio requests, even if the underlying device 
supports larger requests.

For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:

$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

Where the Raid10 md device only supports 512k:

$ cat /sys/block/md0/queue/discard_max_bytes
524288
$ cat /sys/block/md0/queue/discard_max_hw_bytes
524288

If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 
minutes and if we examine the stack, it is stuck in blkdev_issue_discard()

$ sudo cat /proc/1626/stack
[<0>] wait_barrier+0x14c/0x230 [raid10]
[<0>] regular_request_wait+0x39/0x150 [raid10]
[<0>] raid10_write_request+0x11e/0x850 [raid10]
[<0>] raid10_make_request+0xd7/0x150 [raid10]
[<0>] md_handle_request+0x123/0x1a0
[<0>] md_submit_bio+0xda/0x120
[<0>] __submit_bio_noacct+0xde/0x320
[<0>] submit_bio_noacct+0x4d/0x90
[<0>] submit_bio+0x4f/0x1b0
[<0>] __blkdev_issue_discard+0x154/0x290
[<0>] blkdev_issue_discard+0x5d/0xc0
[<0>] blk_ioctl_discard+0xc4/0x110
[<0>] blkdev_common_ioctl+0x56c/0x840
[<0>] blkdev_ioctl+0xeb/0x270
[<0>] block_ioctl+0x3d/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[Fix]

Xiao Ni has developed a patchset which resolves the block discard performance
problems. These commits have now landed in 5.13-rc1.

commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
Author: Xiao Ni <xni at redhat.com>
Date: Thu Feb 4 15:50:43 2021 +0800
Subject: md: add md_submit_discard_bio() for submitting discard bio
Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8

commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
Author: Xiao Ni <xni at redhat.com>
Date: Thu Feb 4 15:50:44 2021 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925

commit f2e7e269a7525317752d472bb48a549780e87d22
Author: Xiao Ni <xni at redhat.com>
Date: Thu Feb 4 15:50:45 2021 +0800
Subject: md/raid10: pull the code that wait for blocked dev into one function
Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22

commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
Author: Xiao Ni <xni at redhat.com>
Date: Thu Feb 4 15:50:46 2021 +0800
Subject: md/raid10: improve raid10 discard request
Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa

commit 254c271da0712ea8914f187588e0f81f7678ee2f
Author: Xiao Ni <xni at redhat.com>
Date: Thu Feb 4 15:50:47 2021 +0800
Subject: md/raid10: improve discard request for far layout
Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f

There is also an additional commit which is required, and was merged after 
"md/raid10: improve raid10 discard request" was merged. The following commit 
enables Radid10 to use large discards, instead of splitting into many bios, 
since the technical hurdles have now been removed.

commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
Author: Mike Snitzer <snitzer at redhat.com>
Date: Fri Apr 30 14:38:37 2021 -0400
Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44

The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, 
with the following minor backports:

1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
was recently changed in:

commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
Author: Christoph Hellwig <hch at lst.de>
Date: Wed Jul 1 10:59:44 2020 +0200
Subject: block: rename generic_make_request to submit_bio_noacct
Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead

2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
request_queue argument put back in place. It was recently removed in:

commit 1c02fca620f7273b597591065d366e2cca948d8f
Author: Christoph Hellwig <hch at lst.de>
Date: Thu Dec 3 17:21:38 2020 +0100
Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f

3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
'&' removed for one of their arguments for the 4.15 kernel, due to changes made
in:

commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
Author: Kent Overstreet <kent.overstreet at gmail.com>
Date: Sun May 20 18:25:52 2018 -0400
Subject: md: convert to bioset_init()/mempool_init()
Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8

4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
raid0 and raid10" due to not having the following commit, which was merged in 
5.1-rc1:

commit 61697a6abd24acba941359c6268a94f4afe4a53d
Author: Mike Snitzer <snitzer at redhat.com>
Date: Fri Jan 18 14:19:26 2019 -0500
Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d

5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to 
bio_clone_blkcg_association() due to it changing in:

commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
Author: Dennis Zhou <dennis at kernel.org>
Date: Wed Dec 5 12:10:35 2018 -0500
Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1

[Testcase]

You will need a machine with at least 4x NVMe drives which support block 
discard. I use a i3.8xlarge instance on AWS, since it has all of these things.

$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk

Create a Raid10 array:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Format the array with XFS:

$ time sudo mkfs.xfs /dev/md0
real 11m14.734s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk

Optional, do a fstrim:

$ time sudo fstrim /mnt/disk

real 11m37.643s

There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:

https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test

If you install a test kernel, we can see that performance dramatically improves:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real 0m1.991s
user 0m0.020s
sys 0m0.000s

The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
from 11 minutes to 2 seconds.

Performance Matrix (AWS i3.8xlarge):

Kernel | mkfs.xfs | fstrim
---------------------------------
4.15      | 7m23.449s | 7m20.678s
5.4       | 8m23.219s | 8m23.927s
5.8       | 2m54.990s | 8m22.010s
4.15-test | 0m4.286s  | 0m1.657s
5.4-test  | 0m6.075s  | 0m3.150s
5.8-test  | 0m2.753s  | 0m2.999s

The test kernel also changes the discard_max_bytes to the underlying hardware 
limit:

$ cat /sys/block/md0/queue/discard_max_bytes
2199023255040

[Where problems can occur]

A problem has occurred once before, with the previous revision of this patchset.
This has been documented in bug 1907262, and caused a worst case scenario of 
data loss for some users, in this particular case, on the second and onward 
disks. This was due to two two faults: the first, incorrectly calculating the 
start offset for block discard for the second and extra disks. The second bug 
was an incorrect stripe size for far layouts.

The kernel team was forced to revert the patches in an emergency and the faulty 
kernel was removed from the archive, and community users urged to avoid the 
faulty kernel.

These bugs and a few other minor issues have now been corrected, and we have 
been testing the new patches since mid February. The patches have been tested 
against the testcase in bug 1907262 and do not cause the disks to become 
corrupted.

The regression potential is still the same for this patchset though. If a 
regression were to occur, it could lead to data loss on Raid10 arrays backed by 
NVMe or SSD disks that support block discard.

If a regression happens, users need to disable the fstrim systemd service as 
soon as possible, plan an emergency maintenance window, and downgrade the kernel
to a previous release, or upgrade to a corrected kernel.

Mike Snitzer (1):
  dm raid: remove unnecessary discard limits for raid0 and raid10

Xiao Ni (5):
  md: add md_submit_discard_bio() for submitting discard bio
  md/raid10: extend r10bio devs to raid disks
  md/raid10: pull the code that wait for blocked dev into one function
  md/raid10: improve raid10 discard request
  md/raid10: improve discard request for far layout

 drivers/md/dm-raid.c |   9 -
 drivers/md/md.c      |  20 ++
 drivers/md/md.h      |   2 +
 drivers/md/raid0.c   |  14 +-
 drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
 drivers/md/raid10.h  |   1 +
 6 files changed, 402 insertions(+), 78 deletions(-)

-- 
2.30.2