[SRU][B][F][G][H][PATCH 0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Mon May 10 01:25:48 UTC 2021

Hi Tim,

I appreciate your cautiousness, and I too am happy to play the conservative game
on this particular patchset. I _really_ don't want to cause another regression,
the one in December caused too much trouble for everyone as it was.

I'm happy to NACK the patchset for this cycle, and resubmit in the next cycle,
as long as we set out and agree on what additional regression testing should be
performed, and if I go perform those tests then you would seriously consider 
accepting these patches.

Testing that I have currently performed:
- Testing against the testcase of mkfs.xfs / mkfs.ext4 / fstrim on a Raid10 md
  block device, as specified in LP #1896578 [1].
- Testing against the disk corruption regression testcase, as reported in
  LP #1907262 [2]. All disks now fsck clean with this new revision of the 
  patchset, which you can see in comment #15 [3] on LP #1896578. 
- Three customers have tested test kernels and haven't found any issues (yet).

Testing that I could go perform over the next week or two:
- Run xfstests with the generic/*, xfs/* and ext4/* testsuites over two Raid10 
  md block devices, one test block device, one scratch block device. I would
  do two runs, one with a released kernel without the patches, and one with the
  test kernels in [4], to see if there is any regressions.
- I could use a cloud instance with NVMe drives as my primary computer over the
  next two or so weeks, and have my /home on a Raid10 md block device, and
  change the fstrim systemd timer from weekly to every 30 minutes, and see if I
  come across data corruption.

Just so that you are aware, I have three different customers who would very much
like these patches to land in the Ubuntu kernels. Two are deploying systems via
MAAS or curtin, and are seeing deployment timeouts when deploying Raid10 to
NVMe disks, since their larger arrays take 2-3 hours to perform a block discard
on with current kernels, when normal systems only take 15-30 mins to deploy, so
they don't want to increase the deployment timeout setting for this one outlier.
The other customer doesn't want to spend 2-3 hours waiting for their Raid10 md
arrays to format with a filesystem, when it could take 4 seconds instead.

I know this is a big change, and I know that this set has already caused grief
with a regression in December, but customers are requesting this feature, and
because of that, I'm willing to work with you to figure out appropriate testing
required, and hopefully get this landed in Ubuntu kernels safely in the near
future.

Let me know what additional testing you would like to see, and I will go and
complete it.

Thanks,
Matthew

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262
[3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/comments/15
[4] https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test

On 8/05/21 1:28 am, Tim Gardner wrote:
> This is way more code change then I can wrap my head around. Aside from the test cases that exercise block discard, have you run any other file system tests on top of a raid 10 configuration ? These changes are not isolated to just raid 10, so perhaps some testing on other raid configurations would be appropriate. I know you've already put a lot of work into this, but its a huge change. I'd also be inclined to let the upstream changes bake awhile, then monitor for subsequent fixes.
> 
> This isn't a NACK per say, but I'm in no hurry to ACK just yet.
> 
> rtg
> 
> On 5/5/21 10:04 PM, Matthew Ruffell wrote:
>> BugLink: https://bugs.launchpad.net/bugs/1896578
>>
>> [Impact]
>>
>> Block discard is very slow on Raid10, which causes common use cases which invoke
>> block discard, such as mkfs and fstrim operations, to take a very long time.
>>
>> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
>> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
>> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
>>
>> The bigger the devices, the longer it takes.
>>
>> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
>> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
>> request into millions of 512k bio requests, even if the underlying device
>> supports larger requests.
>>
>> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
>>
>> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
>> 2199023255040
>> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
>> 2199023255040
>>
>> Where the Raid10 md device only supports 512k:
>>
>> $ cat /sys/block/md0/queue/discard_max_bytes
>> 524288
>> $ cat /sys/block/md0/queue/discard_max_hw_bytes
>> 524288
>>
>> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
>> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
>>
>> $ sudo cat /proc/1626/stack
>> [<0>] wait_barrier+0x14c/0x230 [raid10]
>> [<0>] regular_request_wait+0x39/0x150 [raid10]
>> [<0>] raid10_write_request+0x11e/0x850 [raid10]
>> [<0>] raid10_make_request+0xd7/0x150 [raid10]
>> [<0>] md_handle_request+0x123/0x1a0
>> [<0>] md_submit_bio+0xda/0x120
>> [<0>] __submit_bio_noacct+0xde/0x320
>> [<0>] submit_bio_noacct+0x4d/0x90
>> [<0>] submit_bio+0x4f/0x1b0
>> [<0>] __blkdev_issue_discard+0x154/0x290
>> [<0>] blkdev_issue_discard+0x5d/0xc0
>> [<0>] blk_ioctl_discard+0xc4/0x110
>> [<0>] blkdev_common_ioctl+0x56c/0x840
>> [<0>] blkdev_ioctl+0xeb/0x270
>> [<0>] block_ioctl+0x3d/0x50
>> [<0>] __x64_sys_ioctl+0x91/0xc0
>> [<0>] do_syscall_64+0x38/0x90
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> [Fix]
>>
>> Xiao Ni has developed a patchset which resolves the block discard performance
>> problems. These commits have now landed in 5.13-rc1.
>>
>> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
>> Author: Xiao Ni <xni at redhat.com>
>> Date: Thu Feb 4 15:50:43 2021 +0800
>> Subject: md: add md_submit_discard_bio() for submitting discard bio
>> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
>>
>> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
>> Author: Xiao Ni <xni at redhat.com>
>> Date: Thu Feb 4 15:50:44 2021 +0800
>> Subject: md/raid10: extend r10bio devs to raid disks
>> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
>>
>> commit f2e7e269a7525317752d472bb48a549780e87d22
>> Author: Xiao Ni <xni at redhat.com>
>> Date: Thu Feb 4 15:50:45 2021 +0800
>> Subject: md/raid10: pull the code that wait for blocked dev into one function
>> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
>>
>> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
>> Author: Xiao Ni <xni at redhat.com>
>> Date: Thu Feb 4 15:50:46 2021 +0800
>> Subject: md/raid10: improve raid10 discard request
>> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
>>
>> commit 254c271da0712ea8914f187588e0f81f7678ee2f
>> Author: Xiao Ni <xni at redhat.com>
>> Date: Thu Feb 4 15:50:47 2021 +0800
>> Subject: md/raid10: improve discard request for far layout
>> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
>>
>> There is also an additional commit which is required, and was merged after
>> "md/raid10: improve raid10 discard request" was merged. The following commit
>> enables Radid10 to use large discards, instead of splitting into many bios,
>> since the technical hurdles have now been removed.
>>
>> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
>> Author: Mike Snitzer <snitzer at redhat.com>
>> Date: Fri Apr 30 14:38:37 2021 -0400
>> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
>> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
>>
>> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
>> with the following minor backports:
>>
>> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
>> was recently changed in:
>>
>> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>> Author: Christoph Hellwig <hch at lst.de>
>> Date: Wed Jul 1 10:59:44 2020 +0200
>> Subject: block: rename generic_make_request to submit_bio_noacct
>> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>>
>> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
>> request_queue argument put back in place. It was recently removed in:
>>
>> commit 1c02fca620f7273b597591065d366e2cca948d8f
>> Author: Christoph Hellwig <hch at lst.de>
>> Date: Thu Dec 3 17:21:38 2020 +0100
>> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
>> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
>>
>> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
>> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
>> in:
>>
>> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
>> Author: Kent Overstreet <kent.overstreet at gmail.com>
>> Date: Sun May 20 18:25:52 2018 -0400
>> Subject: md: convert to bioset_init()/mempool_init()
>> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
>>
>> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
>> raid0 and raid10" due to not having the following commit, which was merged in
>> 5.1-rc1:
>>
>> commit 61697a6abd24acba941359c6268a94f4afe4a53d
>> Author: Mike Snitzer <snitzer at redhat.com>
>> Date: Fri Jan 18 14:19:26 2019 -0500
>> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
>> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
>>
>> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
>> bio_clone_blkcg_association() due to it changing in:
>>
>> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
>> Author: Dennis Zhou <dennis at kernel.org>
>> Date: Wed Dec 5 12:10:35 2018 -0500
>> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
>> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
>>
>> [Testcase]
>>
>> You will need a machine with at least 4x NVMe drives which support block
>> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
>>
>> $ lsblk
>> xvda 202:0 0 8G 0 disk
>> └─xvda1 202:1 0 8G 0 part /
>> nvme0n1 259:2 0 1.7T 0 disk
>> nvme1n1 259:0 0 1.7T 0 disk
>> nvme2n1 259:1 0 1.7T 0 disk
>> nvme3n1 259:3 0 1.7T 0 disk
>>
>> Create a Raid10 array:
>>
>> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Format the array with XFS:
>>
>> $ time sudo mkfs.xfs /dev/md0
>> real 11m14.734s
>>
>> $ sudo mkdir /mnt/disk
>> $ sudo mount /dev/md0 /mnt/disk
>>
>> Optional, do a fstrim:
>>
>> $ time sudo fstrim /mnt/disk
>>
>> real 11m37.643s
>>
>> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
>>
>> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
>>
>> If you install a test kernel, we can see that performance dramatically improves:
>>
>> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> $ time sudo mkfs.xfs /dev/md0
>> real 0m4.226s
>> user 0m0.020s
>> sys 0m0.148s
>>
>> $ sudo mkdir /mnt/disk
>> $ sudo mount /dev/md0 /mnt/disk
>> $ time sudo fstrim /mnt/disk
>>
>> real 0m1.991s
>> user 0m0.020s
>> sys 0m0.000s
>>
>> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
>> from 11 minutes to 2 seconds.
>>
>> Performance Matrix (AWS i3.8xlarge):
>>
>> Kernel | mkfs.xfs | fstrim
>> ---------------------------------
>> 4.15      | 7m23.449s | 7m20.678s
>> 5.4       | 8m23.219s | 8m23.927s
>> 5.8       | 2m54.990s | 8m22.010s
>> 4.15-test | 0m4.286s  | 0m1.657s
>> 5.4-test  | 0m6.075s  | 0m3.150s
>> 5.8-test  | 0m2.753s  | 0m2.999s
>>
>> The test kernel also changes the discard_max_bytes to the underlying hardware
>> limit:
>>
>> $ cat /sys/block/md0/queue/discard_max_bytes
>> 2199023255040
>>
>> [Where problems can occur]
>>
>> A problem has occurred once before, with the previous revision of this patchset.
>> This has been documented in bug 1907262, and caused a worst case scenario of
>> data loss for some users, in this particular case, on the second and onward
>> disks. This was due to two two faults: the first, incorrectly calculating the
>> start offset for block discard for the second and extra disks. The second bug
>> was an incorrect stripe size for far layouts.
>>
>> The kernel team was forced to revert the patches in an emergency and the faulty
>> kernel was removed from the archive, and community users urged to avoid the
>> faulty kernel.
>>
>> These bugs and a few other minor issues have now been corrected, and we have
>> been testing the new patches since mid February. The patches have been tested
>> against the testcase in bug 1907262 and do not cause the disks to become
>> corrupted.
>>
>> The regression potential is still the same for this patchset though. If a
>> regression were to occur, it could lead to data loss on Raid10 arrays backed by
>> NVMe or SSD disks that support block discard.
>>
>> If a regression happens, users need to disable the fstrim systemd service as
>> soon as possible, plan an emergency maintenance window, and downgrade the kernel
>> to a previous release, or upgrade to a corrected kernel.
>>
>> Mike Snitzer (1):
>>    dm raid: remove unnecessary discard limits for raid0 and raid10
>>
>> Xiao Ni (5):
>>    md: add md_submit_discard_bio() for submitting discard bio
>>    md/raid10: extend r10bio devs to raid disks
>>    md/raid10: pull the code that wait for blocked dev into one function
>>    md/raid10: improve raid10 discard request
>>    md/raid10: improve discard request for far layout
>>
>>   drivers/md/dm-raid.c |   9 -
>>   drivers/md/md.c      |  20 ++
>>   drivers/md/md.h      |   2 +
>>   drivers/md/raid0.c   |  14 +-
>>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>>   drivers/md/raid10.h  |   1 +
>>   6 files changed, 402 insertions(+), 78 deletions(-)
>>
>