ACK: [SRU][Xenial][PATCH 0/1] Performance degradation when copying from LVM snapshot backed by NVMe disk
Colin Ian King
colin.king at canonical.com
Thu Jun 20 15:08:09 UTC 2019
On 19/06/2019 05:34, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1833319
>
> [Impact]
> When copying files from a mounted LVM snapshot which resides on NVMe storage
> devices, there is a massive performance degradation in the rate sectors are
> read from the disk.
>
> The kernel is not merging sector requests and is instead issuing many small
> sector requests to the NVMe storage controller instead of one larger request.
>
> Experiments have shown a 14x-25x performance degradation in reads, where copies
> used to take seconds, now take minutes, and copies which took thirty minutes
> now take many hours.
>
> The following was found with btrace, running alongside cat (see Testing):
>
> A = IO remapped to different device
> Q = IO handled by request queue
> G = Get request
> U = Unplug request
> I = IO inserted onto request queue
> D = IO issued to driver
> C = IO completion
>
> When reading from the LVM snapshot, we see:
>
> 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824
> 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat]
> 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat]
> 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832
> 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat]
> 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat]
> 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat]
> 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat]
> 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat]
> ...
> 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H]
> 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H]
> 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H]
> ...
> 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0]
>
> Now what is happening here, is that a request for 8 sector read is placed onto
> the IO request queue, and is then inserted one at a time to the driver request
> queue and then fetched by the driver.
>
> Comparing this behaviour to reading data from a LVM snapshot on 4.6+ mainline
> or the Ubuntu 4.15 HWE kernel:
>
> M = IO back merged with request on queue
>
> 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912
> 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat]
> 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat]
> 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920
> 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat]
> 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat]
> 259,0 0 200 0.000534474 1897 UT N [cat] 1
> 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat]
> 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat]
> 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0]
>
> This shows us a 8 sector read is added to the request queue, and is then
> subsequently [M]erged backward with other requests on the queue until the sum
> of all of those merged requests becomes 512 sectors. From there, the 512 sector
> read is placed onto the IO queue, where it is fetched by the device driver, and
> completes.
>
> [Fix]
>
> The problem is that the NVMe driver on 4.4 xenial kernel is not merging 8
> sector requests.
>
> Merging is controlled per device by this sysfs entry:
> /sys/block/nvme1n1/queue/nomerges
>
> On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES).
> On 4.5+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge.
>
> Setting this to 0 on the 4.4 kernel with:
>
> # echo "0" > /sys/block/nvme1n1/queue/nomerges
>
> and testing again, we find performance is restored and the problem is fixed.
>
> Performing a btrace, we see 8 sector reads get backmerged into a 512 sector
> read which is done in one go.
>
> The problem was fixed in 4.5 upstream with the below commit:
>
> commit ef2d4615c59efb312e531a5e949970f37ca1c841
> Author: Keith Busch <keith.busch at intel.com>
> Date: Thu Feb 11 13:05:40 2016 -0700
> Subject: NVMe: Allow request merges
>
> This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver
> init, allowing requests to be backmerged. This also has a direct effect of
> defaulting /sys/block/nvme1n1/queue/nomerges to 0.
>
> Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4
> kernels.
>
> [Testcase]
>
> You can replicate the problem with a system with a NVMe disk. I recommend using
> c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger.
>
> Steps (with NVMe disk being /dev/nvme1n1):
> 1. sudo pvcreate /dev/nvme1n1
> 2. sudo vgcreate secvol /dev/nvme1n1
> 3. sudo lvcreate --name seclv -l 80%FREE secvol
> 4. sudo mkfs.ext4 /dev/secvol/seclv
> 5. sudo mount /dev/mapper/secvol-seclv /mnt
> 6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done
> 7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE'
> 8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX)
> 9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT
>
> To replicate, simply read one of those 512mb files:
> 10. time cat $NEWMOUNT/dummy1 1> /dev/null
>
> On a stock xenial kernel, expect to see the following:
>
> 4.4.0-151-generic #178-Ubuntu
>
> $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null
>
> real 0m42.693s
> user 0m0.008s
> sys 0m0.388s
> $ cat /sys/block/nvme1n1/queue/nomerges
> 2
>
> On a patched xenial kernel, performance is restored:
>
> 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu
>
> $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null
>
> real 0m1.773s
> user 0m0.008s
> sys 0m0.184s
> $ cat /sys/block/nvme1n1/queue/nomerges
> 0
>
> [Regression Potential]
>
> Cherry-picking "NVMe: Allow request merges" changes the default request policy
> for NVMe drives, which may give some cause for concern in both terms of
> stability and performance for other workloads.
>
> Regarding stability, this flag was originally set when the NVMe driver was
> bio based, before the driver had been converted to blk-mq and separated out
> from /block. You can read a mailing list thread about it here:
>
> https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html
>
> Along with the commit "MD: make bio mergeable" there is no reason to not allow
> requests to be mergeable for the new NVMe driver.
>
> Regarding performance for other workloads, I reference the commit which
> QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced:
> commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c
> subject: block: Added in stricter no merge semantics for block I/O
>
> nomerges Throughput %System Improvement (tput / %sys)
> -------- ------------ ----------- -------------------------
> 0 12.45 MB/sec 0.669365609
> 1 12.50 MB/sec 0.641519199 0.40% / 2.71%
> 2 12.52 MB/sec 0.639849750 0.56% / 2.96%
>
> It shows a 0.56% performance increase for no merging / 2, over allowing
> merging / 0 for random IO workloads.
>
> Comparing this with the 14x-25x performance degradation for reads where requests
> are not able to be merged, it is clear that changing the default to 0 will not
> impact any other workloads by any significant margin.
>
> The commit is also present in Linux 4.5 mainline, can be cleanly cherry-picked
> and is still present in the kernel to this day, and after review of the NVMe
> driver, I believe there will be no regressions.
>
> If you are interested in testing, I have prepared two ppas with
> ef2d4615c59efb312e531a5e949970f37ca1c841 patched:
>
> linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic
> linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test
>
> Keith Busch (1):
> NVMe: Allow request merges
>
> drivers/nvme/host/core.c | 1 -
> 1 file changed, 1 deletion(-)
>
Clean cherry pick, we are carrying this in Bionic+, it's a small fix and
has good test results. Seems good to me. Thanks Matthew.
Acked-by: Colin Ian King <colin.king at canonical.com>
More information about the kernel-team
mailing list