[SRU][B][F][G][H][PATCH 0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Fri May 7 13:28:28 UTC 2021

This is way more code change then I can wrap my head around. Aside from 
the test cases that exercise block discard, have you run any other file 
system tests on top of a raid 10 configuration ? These changes are not 
isolated to just raid 10, so perhaps some testing on other raid 
configurations would be appropriate. I know you've already put a lot of 
work into this, but its a huge change. I'd also be inclined to let the 
upstream changes bake awhile, then monitor for subsequent fixes.

This isn't a NACK per say, but I'm in no hurry to ACK just yet.

rtg

On 5/5/21 10:04 PM, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.13-rc1.
> 
> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> Author: Xiao Ni <xni at redhat.com>
> Date: Thu Feb 4 15:50:43 2021 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> 
> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
> Author: Xiao Ni <xni at redhat.com>
> Date: Thu Feb 4 15:50:44 2021 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
> 
> commit f2e7e269a7525317752d472bb48a549780e87d22
> Author: Xiao Ni <xni at redhat.com>
> Date: Thu Feb 4 15:50:45 2021 +0800
> Subject: md/raid10: pull the code that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
> 
> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
> Author: Xiao Ni <xni at redhat.com>
> Date: Thu Feb 4 15:50:46 2021 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
> 
> commit 254c271da0712ea8914f187588e0f81f7678ee2f
> Author: Xiao Ni <xni at redhat.com>
> Date: Thu Feb 4 15:50:47 2021 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
> 
> There is also an additional commit which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commit
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
> 
> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Fri Apr 30 14:38:37 2021 -0400
> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
> 
> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
> with the following minor backports:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch at lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
> request_queue argument put back in place. It was recently removed in:
> 
> commit 1c02fca620f7273b597591065d366e2cca948d8f
> Author: Christoph Hellwig <hch at lst.de>
> Date: Thu Dec 3 17:21:38 2020 +0100
> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
> 
> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet at gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
> raid0 and raid10" due to not having the following commit, which was merged in
> 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis at kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block
> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real 11m37.643s
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
> 
> [Where problems can occur]
> 
> A problem has occurred once before, with the previous revision of this patchset.
> This has been documented in bug 1907262, and caused a worst case scenario of
> data loss for some users, in this particular case, on the second and onward
> disks. This was due to two two faults: the first, incorrectly calculating the
> start offset for block discard for the second and extra disks. The second bug
> was an incorrect stripe size for far layouts.
> 
> The kernel team was forced to revert the patches in an emergency and the faulty
> kernel was removed from the archive, and community users urged to avoid the
> faulty kernel.
> 
> These bugs and a few other minor issues have now been corrected, and we have
> been testing the new patches since mid February. The patches have been tested
> against the testcase in bug 1907262 and do not cause the disks to become
> corrupted.
> 
> The regression potential is still the same for this patchset though. If a
> regression were to occur, it could lead to data loss on Raid10 arrays backed by
> NVMe or SSD disks that support block discard.
> 
> If a regression happens, users need to disable the fstrim systemd service as
> soon as possible, plan an emergency maintenance window, and downgrade the kernel
> to a previous release, or upgrade to a corrected kernel.
> 
> Mike Snitzer (1):
>    dm raid: remove unnecessary discard limits for raid0 and raid10
> 
> Xiao Ni (5):
>    md: add md_submit_discard_bio() for submitting discard bio
>    md/raid10: extend r10bio devs to raid disks
>    md/raid10: pull the code that wait for blocked dev into one function
>    md/raid10: improve raid10 discard request
>    md/raid10: improve discard request for far layout
> 
>   drivers/md/dm-raid.c |   9 -
>   drivers/md/md.c      |  20 ++
>   drivers/md/md.h      |   2 +
>   drivers/md/raid0.c   |  14 +-
>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>   drivers/md/raid10.h  |   1 +
>   6 files changed, 402 insertions(+), 78 deletions(-)
> 

-- 
-----------
Tim Gardner
Canonical, Inc