[SRU][B][F][G][PATCH 0/7] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Thu Oct 29 03:07:27 UTC 2020

BugLink: https://bugs.launchpad.net/bugs/1896578

[Impact]

Block discard is very slow on Raid10, which causes common use cases which invoke
block discard, such as mkfs and fstrim operations, to take a very long time.

For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.

The bigger the devices, the longer it takes.

The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
request into millions of 512k bio requests, even if the underlying device
supports larger requests.

For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:

$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040 

Where the Raid10 md device only supports 512k:

$ cat /sys/block/md0/queue/discard_max_bytes
524288
$ cat /sys/block/md0/queue/discard_max_hw_bytes
524288 

If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes
and if we examine the stack, it is stuck in blkdev_issue_discard()

$ sudo cat /proc/1626/stack
[<0>] wait_barrier+0x14c/0x230 [raid10]
[<0>] regular_request_wait+0x39/0x150 [raid10]
[<0>] raid10_write_request+0x11e/0x850 [raid10]
[<0>] raid10_make_request+0xd7/0x150 [raid10]
[<0>] md_handle_request+0x123/0x1a0
[<0>] md_submit_bio+0xda/0x120
[<0>] __submit_bio_noacct+0xde/0x320
[<0>] submit_bio_noacct+0x4d/0x90
[<0>] submit_bio+0x4f/0x1b0
[<0>] __blkdev_issue_discard+0x154/0x290
[<0>] blkdev_issue_discard+0x5d/0xc0
[<0>] blk_ioctl_discard+0xc4/0x110
[<0>] blkdev_common_ioctl+0x56c/0x840
[<0>] blkdev_ioctl+0xeb/0x270
[<0>] block_ioctl+0x3d/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 

[Fix]

Xiao Ni has developed a patchset which resolves the block discard performance 
problems. These commits have now landed in 5.10-rc1.

commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
Author: Xiao Ni <xni at redhat.com>
Date: Tue Aug 25 13:42:59 2020 +0800
Subject: md: add md_submit_discard_bio() for submitting discard bio
Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0

commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
Author: Xiao Ni <xni at redhat.com>
Date: Tue Aug 25 13:43:00 2020 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3

commit f046f5d0d79cdb968f219ce249e497fd1accf484
Author: Xiao Ni <xni at redhat.com>
Date: Tue Aug 25 13:43:01 2020 +0800
Subject: md/raid10: pull codes that wait for blocked dev into one function
Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484

commit bcc90d280465ebd51ab8688be86e1f00c62dccf9
Author: Xiao Ni <xni at redhat.com>
Date: Wed Sep 2 20:00:22 2020 +0800
Subject: md/raid10: improve raid10 discard request
Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9

commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359
Author: Xiao Ni <xni at redhat.com>
Date: Wed Sep 2 20:00:23 2020 +0800
Subject: md/raid10: improve discard request for far layout
Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359

There is also some additional commits which is required, and was merged after 
"md/raid10: improve raid10 discard request" was merged. The following commits 
enables Radid10 to use large discards, instead of splitting into many bios, 
since the technical hurdles have now been removed.

commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512
Author: Mike Snitzer <snitzer at redhat.com>
Date: Thu Sep 24 13:14:52 2020 -0400
Subject: dm raid: fix discard limits for raid1 and raid10
Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512

commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28
Author: Mike Snitzer <snitzer at redhat.com>
Date: Thu Sep 24 16:40:12 2020 -0400
Subject: dm raid: remove unnecessary discard limits for raid10
Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28

All the commits mentioned follow a similar strategy which was implemented in 
Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block 
discard performance issues in Raid0:

commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
Author: Shaohua Li <shli at fb.com>
Date: Sun May 7 17:36:24 2017 -0700
Subject: md/md0: optimize raid0 discard handling
Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 

The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the
following minor fixups:

1) submit_bio_noacct() needed to be renamed to generic_make_request() since it 
was recently changed in:

commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
Author: Christoph Hellwig <hch at lst.de>
Date:   Wed Jul 1 10:59:44 2020 +0200
Subject: block: rename generic_make_request to submit_bio_noacct
Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead

2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
'&' removed for one of their arguments for the 4.15 kernel, due to changes made
in:

commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
Author: Kent Overstreet <kent.overstreet at gmail.com>
Date:   Sun May 20 18:25:52 2018 -0400
Subject: md: convert to bioset_init()/mempool_init()
Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8

3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10"
and "dm raid: remove unnecessary discard limits for raid10" due to not having
the following commit, which was merged in 5.1-rc1:

commit 61697a6abd24acba941359c6268a94f4afe4a53d
Author: Mike Snitzer <snitzer at redhat.com>
Date:   Fri Jan 18 14:19:26 2019 -0500
Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d

4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to 
bio_clone_blkcg_association() due to it changing in:

commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
Author: Dennis Zhou <dennis at kernel.org>
Date:   Wed Dec 5 12:10:35 2018 -0500
Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1

[Testcase]

You will need a machine with at least 4x NVMe drives which support block discard.
I use a i3.8xlarge instance on AWS, since it has all of these things.

$ lsblk 
xvda    202:0    0    8G  0 disk 
└─xvda1 202:1    0    8G  0 part /
nvme0n1 259:2    0  1.7T  0 disk 
nvme1n1 259:0    0  1.7T  0 disk 
nvme2n1 259:1    0  1.7T  0 disk 
nvme3n1 259:3    0  1.7T  0 disk

Create a Raid10 array:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Format the array with XFS:

$ time sudo mkfs.xfs /dev/md0
real 11m14.734s 

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk

Optional, do a fstrim:

$ time sudo fstrim /mnt/disk

real	11m37.643s 

There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:

https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test

If you install a test kernel, we can see that performance dramatically improves:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

$ time sudo mkfs.xfs /dev/md0
real	0m4.226s
user	0m0.020s
sys	0m0.148s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real	0m1.991s
user	0m0.020s
sys	0m0.000s

The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
from 11 minutes to 2 seconds.

Performance Matrix (AWS i3.8xlarge):

Kernel    | mkfs.xfs  | fstrim
---------------------------------
4.15      | 7m23.449s | 7m20.678s
5.4       | 8m23.219s | 8m23.927s
5.8       | 2m54.990s | 8m22.010s
4.15-test | 0m4.286s  | 0m1.657s
5.4-test  | 0m6.075s  | 0m3.150s
5.8-test  | 0m2.753s  | 0m2.999s

The test kernel also changes the discard_max_bytes to the underlying hardware
limit:

$ cat /sys/block/md0/queue/discard_max_bytes 
2199023255040

[Regression Potential]

If a regression were to occur, then it would affect operations which would
trigger block discard operations, such as mkfs and fstrim, on Raid10 only.

Other Raid levels would not be affected, although, I should note there will be
a small risk of regression to Raid0, due to one of its functions being
re-factored and split out, for use in both Raid0 and Raid10.

The changes only affect block discard, so only Raid10 arrays backed by SSD or
NVMe devices which support block discard will be affected. Traditional hard
disks, or SSD devices which do not support block discard would not be affected.

If a regression were to occur, users could work around the issue by running
"mkfs.xfs -K <device>" which would skip block discard entirely.

Mike Snitzer (2):
  dm raid: fix discard limits for raid1 and raid10
  dm raid: remove unnecessary discard limits for raid10

Xiao Ni (5):
  md: add md_submit_discard_bio() for submitting discard bio
  md/raid10: extend r10bio devs to raid disks
  md/raid10: pull codes that wait for blocked dev into one function
  md/raid10: improve raid10 discard request
  md/raid10: improve discard request for far layout

 drivers/md/dm-raid.c |   9 -
 drivers/md/md.c      |  20 ++
 drivers/md/md.h      |   2 +
 drivers/md/raid0.c   |  14 +-
 drivers/md/raid10.c  | 423 +++++++++++++++++++++++++++++++++++++------
 drivers/md/raid10.h  |   1 +
 6 files changed, 391 insertions(+), 78 deletions(-)

-- 
2.27.0