APPLIED[F]: [SRU][B][F][G][PATCH 0/7] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Ian May
ian.may at canonical.com
Fri Nov 6 06:07:01 UTC 2020
Applied to Focal/master-next
Thanks,
Ian
On 2020-10-29 16:07:27 , Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
>
> [Impact]
>
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
>
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
>
> The bigger the devices, the longer it takes.
>
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
>
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
>
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
>
> Where the Raid10 md device only supports 512k:
>
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
>
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes
> and if we examine the stack, it is stuck in blkdev_issue_discard()
>
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [Fix]
>
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.10-rc1.
>
> commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:42:59 2020 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
>
> commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:43:00 2020 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
>
> commit f046f5d0d79cdb968f219ce249e497fd1accf484
> Author: Xiao Ni <xni at redhat.com>
> Date: Tue Aug 25 13:43:01 2020 +0800
> Subject: md/raid10: pull codes that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484
>
> commit bcc90d280465ebd51ab8688be86e1f00c62dccf9
> Author: Xiao Ni <xni at redhat.com>
> Date: Wed Sep 2 20:00:22 2020 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9
>
> commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359
> Author: Xiao Ni <xni at redhat.com>
> Date: Wed Sep 2 20:00:23 2020 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359
>
> There is also some additional commits which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commits
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
>
> commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Thu Sep 24 13:14:52 2020 -0400
> Subject: dm raid: fix discard limits for raid1 and raid10
> Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512
>
> commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Thu Sep 24 16:40:12 2020 -0400
> Subject: dm raid: remove unnecessary discard limits for raid10
> Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28
>
> All the commits mentioned follow a similar strategy which was implemented in
> Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block
> discard performance issues in Raid0:
>
> commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
> Author: Shaohua Li <shli at fb.com>
> Date: Sun May 7 17:36:24 2017 -0700
> Subject: md/md0: optimize raid0 discard handling
> Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0
>
> The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the
> following minor fixups:
>
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
>
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch at lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>
> 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
>
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet at gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
>
> 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10"
> and "dm raid: remove unnecessary discard limits for raid10" due to not having
> the following commit, which was merged in 5.1-rc1:
>
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer at redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
>
> 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
>
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis at kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
>
> [Testcase]
>
> You will need a machine with at least 4x NVMe drives which support block discard.
> I use a i3.8xlarge instance on AWS, since it has all of these things.
>
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
>
> Create a Raid10 array:
>
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> Format the array with XFS:
>
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
>
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
>
> Optional, do a fstrim:
>
> $ time sudo fstrim /mnt/disk
>
> real 11m37.643s
>
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test
>
> If you install a test kernel, we can see that performance dramatically improves:
>
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
>
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
>
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
>
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
>
> Performance Matrix (AWS i3.8xlarge):
>
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15 | 7m23.449s | 7m20.678s
> 5.4 | 8m23.219s | 8m23.927s
> 5.8 | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s | 0m1.657s
> 5.4-test | 0m6.075s | 0m3.150s
> 5.8-test | 0m2.753s | 0m2.999s
>
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
>
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
>
> [Regression Potential]
>
> If a regression were to occur, then it would affect operations which would
> trigger block discard operations, such as mkfs and fstrim, on Raid10 only.
>
> Other Raid levels would not be affected, although, I should note there will be
> a small risk of regression to Raid0, due to one of its functions being
> re-factored and split out, for use in both Raid0 and Raid10.
>
> The changes only affect block discard, so only Raid10 arrays backed by SSD or
> NVMe devices which support block discard will be affected. Traditional hard
> disks, or SSD devices which do not support block discard would not be affected.
>
> If a regression were to occur, users could work around the issue by running
> "mkfs.xfs -K <device>" which would skip block discard entirely.
>
> Mike Snitzer (2):
> dm raid: fix discard limits for raid1 and raid10
> dm raid: remove unnecessary discard limits for raid10
>
> Xiao Ni (5):
> md: add md_submit_discard_bio() for submitting discard bio
> md/raid10: extend r10bio devs to raid disks
> md/raid10: pull codes that wait for blocked dev into one function
> md/raid10: improve raid10 discard request
> md/raid10: improve discard request for far layout
>
> drivers/md/dm-raid.c | 9 -
> drivers/md/md.c | 20 ++
> drivers/md/md.h | 2 +
> drivers/md/raid0.c | 14 +-
> drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------
> drivers/md/raid10.h | 1 +
> 6 files changed, 391 insertions(+), 78 deletions(-)
>
> --
> 2.27.0
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list