[SRU][B][F][G][PATCH 0/7] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Stefan Bader stefan.bader at canonical.com
Thu Oct 29 08:46:08 UTC 2020


On 29.10.20 04:07, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040 
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288 
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes
> and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance 
> problems. These commits have now landed in 5.10-rc1.
> 
> commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:42:59 2020 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
> 
> commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:43:00 2020 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
> 
> commit f046f5d0d79cdb968f219ce249e497fd1accf484
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:43:01 2020 +0800
> Subject: md/raid10: pull codes that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484
> 
> commit bcc90d280465ebd51ab8688be86e1f00c62dccf9
> Author: Xiao Ni <xni at redhat.com>
> Date: Wed Sep 2 20:00:22 2020 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9
> 
> commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359
> Author: Xiao Ni <xni at redhat.com>
> Date: Wed Sep 2 20:00:23 2020 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359
> 
> There is also some additional commits which is required, and was merged after 
> "md/raid10: improve raid10 discard request" was merged. The following commits 
> enables Radid10 to use large discards, instead of splitting into many bios, 
> since the technical hurdles have now been removed.

The below two patches are marked up as only needed for F and G. What about
Bionic? If the changes they refer to were in 4.12, then those would have to go
to Bionic as well.

Beside that, I am not sure how exactly that might be better phrased, but
personally I stumbled over "remove 'address of' pointer for...". Mabye "do
not use a pointer for one of the arguments to ..." but not sure.

-Stefan

> 
> commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Thu Sep 24 13:14:52 2020 -0400
> Subject: dm raid: fix discard limits for raid1 and raid10
> Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512
> 
> commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Thu Sep 24 16:40:12 2020 -0400
> Subject: dm raid: remove unnecessary discard limits for raid10
> Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28
> 
> All the commits mentioned follow a similar strategy which was implemented in 
> Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block 
> discard performance issues in Raid0:
> 
> commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
> Author: Shaohua Li <shli at fb.com>
> Date: Sun May 7 17:36:24 2017 -0700
> Subject: md/md0: optimize raid0 discard handling
> Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 
> 
> The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the
> following minor fixups:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it 
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch at lst.de>
> Date:   Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet at gmail.com>
> Date:   Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10"
> and "dm raid: remove unnecessary discard limits for raid10" due to not having
> the following commit, which was merged in 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer at redhat.com>
> Date:   Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to 
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis at kernel.org>
> Date:   Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block discard.
> I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk 
> xvda    202:0    0    8G  0 disk 
> └─xvda1 202:1    0    8G  0 part /
> nvme0n1 259:2    0  1.7T  0 disk 
> nvme1n1 259:0    0  1.7T  0 disk 
> nvme2n1 259:1    0  1.7T  0 disk 
> nvme3n1 259:3    0  1.7T  0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s 
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real	11m37.643s 
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real	0m4.226s
> user	0m0.020s
> sys	0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real	0m1.991s
> user	0m0.020s
> sys	0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel    | mkfs.xfs  | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes 
> 2199023255040
> 
> [Regression Potential]
> 
> If a regression were to occur, then it would affect operations which would
> trigger block discard operations, such as mkfs and fstrim, on Raid10 only.
> 
> Other Raid levels would not be affected, although, I should note there will be
> a small risk of regression to Raid0, due to one of its functions being
> re-factored and split out, for use in both Raid0 and Raid10.
> 
> The changes only affect block discard, so only Raid10 arrays backed by SSD or
> NVMe devices which support block discard will be affected. Traditional hard
> disks, or SSD devices which do not support block discard would not be affected.
> 
> If a regression were to occur, users could work around the issue by running
> "mkfs.xfs -K <device>" which would skip block discard entirely.
> 
> Mike Snitzer (2):
>   dm raid: fix discard limits for raid1 and raid10
>   dm raid: remove unnecessary discard limits for raid10
> 
> Xiao Ni (5):
>   md: add md_submit_discard_bio() for submitting discard bio
>   md/raid10: extend r10bio devs to raid disks
>   md/raid10: pull codes that wait for blocked dev into one function
>   md/raid10: improve raid10 discard request
>   md/raid10: improve discard request for far layout
> 
>  drivers/md/dm-raid.c |   9 -
>  drivers/md/md.c      |  20 ++
>  drivers/md/md.h      |   2 +
>  drivers/md/raid0.c   |  14 +-
>  drivers/md/raid10.c  | 423 +++++++++++++++++++++++++++++++++++++------
>  drivers/md/raid10.h  |   1 +
>  6 files changed, 391 insertions(+), 78 deletions(-)
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20201029/beda9b32/attachment-0001.sig>


More information about the kernel-team mailing list