ACK: [SRU][Xenial][PATCH 0/1] Performance degradation when copying from LVM snapshot backed by NVMe disk

Thu Jun 20 15:08:09 UTC 2019

On 19/06/2019 05:34, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1833319
> 
> [Impact]
> When copying files from a mounted LVM snapshot which resides on NVMe storage 
> devices, there is a massive performance degradation in the rate sectors are 
> read from the disk.
> 
> The kernel is not merging sector requests and is instead issuing many small
> sector requests to the NVMe storage controller instead of one larger request.
> 
> Experiments have shown a 14x-25x performance degradation in reads, where copies 
> used to take seconds, now take minutes, and copies which took thirty minutes 
> now take many hours.
> 
> The following was found with btrace, running alongside cat (see Testing):
> 
> A = IO remapped to different device
> Q = IO handled by request queue
> G = Get request
> U = Unplug request
> I = IO inserted onto request queue
> D = IO issued to driver
> C = IO completion
> 
> When reading from the LVM snapshot, we see:
> 
> 259,0 1 113 0.001117160 1606 A R 837872 + 8 <- (252,0) 835824
> 259,0 1 114 0.001117276 1606 Q R 837872 + 8 [cat]
> 259,0 1 115 0.001117451 1606 G R 837872 + 8 [cat]
> 259,0 1 116 0.001117979 1606 A R 837880 + 8 <- (252,0) 835832
> 259,0 1 117 0.001118119 1606 Q R 837880 + 8 [cat]
> 259,0 1 118 0.001118285 1606 G R 837880 + 8 [cat]
> 259,0 1 122 0.001121613 1606 I RS 837640 + 8 [cat]
> 259,0 1 123 0.001121687 1606 I RS 837648 + 8 [cat]
> 259,0 1 124 0.001121758 1606 I RS 837656 + 8 [cat]
> ...
> 259,0 1 154 0.001126118 377 D RS 837648 + 8 [kworker/1:1H]
> 259,0 1 155 0.001126445 377 D RS 837656 + 8 [kworker/1:1H]
> 259,0 1 156 0.001126871 377 D RS 837664 + 8 [kworker/1:1H]
> ...
> 259,0 1 183 0.001848512 0 C RS 837632 + 8 [0]
> 
> Now what is happening here, is that a request for 8 sector read is placed onto 
> the IO request queue, and is then inserted one at a time to the driver request 
> queue and then fetched by the driver.
> 
> Comparing this behaviour to reading data from a LVM snapshot on 4.6+ mainline 
> or the Ubuntu 4.15 HWE kernel:
> 
> M = IO back merged with request on queue
> 
> 259,0 0 194 0.000532515 1897 A R 7358960 + 8 <- (253,0) 7356912
> 259,0 0 195 0.000532634 1897 Q R 7358960 + 8 [cat]
> 259,0 0 196 0.000532810 1897 M R 7358960 + 8 [cat]
> 259,0 0 197 0.000533864 1897 A R 7358968 + 8 <- (253,0) 7356920
> 259,0 0 198 0.000533991 1897 Q R 7358968 + 8 [cat]
> 259,0 0 199 0.000534177 1897 M R 7358968 + 8 [cat]
> 259,0 0 200 0.000534474 1897 UT N [cat] 1
> 259,0 0 201 0.000534586 1897 I R 7358464 + 512 [cat]
> 259,0 0 202 0.000537055 1897 D R 7358464 + 512 [cat]
> 259,0 0 203 0.002242539 0 C R 7358464 + 512 [0]
> 
> This shows us a 8 sector read is added to the request queue, and is then
> subsequently [M]erged backward with other requests on the queue until the sum 
> of all of those merged requests becomes 512 sectors. From there, the 512 sector 
> read is placed onto the IO queue, where it is fetched by the device driver, and 
> completes.
> 
> [Fix]
> 
> The problem is that the NVMe driver on 4.4 xenial kernel is not merging 8 
> sector requests.
> 
> Merging is controlled per device by this sysfs entry:
> /sys/block/nvme1n1/queue/nomerges
> 
> On 4.4 xenial, reading from this yields 2, or (QUEUE_FLAG_NOMERGES).
> On 4.5+ and 4.15 HWE kernel, reading from this yields 0, or allowing merge.
> 
> Setting this to 0 on the 4.4 kernel with:
> 
> # echo "0" > /sys/block/nvme1n1/queue/nomerges
> 
> and testing again, we find performance is restored and the problem is fixed.
> 
> Performing a btrace, we see 8 sector reads get backmerged into a 512 sector 
> read which is done in one go.
> 
> The problem was fixed in 4.5 upstream with the below commit:
> 
> commit ef2d4615c59efb312e531a5e949970f37ca1c841
> Author: Keith Busch <keith.busch at intel.com>
> Date: Thu Feb 11 13:05:40 2016 -0700
> Subject: NVMe: Allow request merges
> 
> This commit removes the QUEUE_FLAG_NOMERGES flag from being set during driver 
> init, allowing requests to be backmerged. This also has a direct effect of 
> defaulting /sys/block/nvme1n1/queue/nomerges to 0.
> 
> Please cherry-pick ef2d4615c59efb312e531a5e949970f37ca1c841 to all xenial 4.4
> kernels.
> 
> [Testcase]
> 
> You can replicate the problem with a system with a NVMe disk. I recommend using 
> c5.large AWS EC2 instances with a secondary gpt2 EBS disk of 200gb or larger.
> 
> Steps (with NVMe disk being /dev/nvme1n1):
>   1. sudo pvcreate /dev/nvme1n1
>   2. sudo vgcreate secvol /dev/nvme1n1
>   3. sudo lvcreate --name seclv -l 80%FREE secvol
>   4. sudo mkfs.ext4 /dev/secvol/seclv
>   5. sudo mount /dev/mapper/secvol-seclv /mnt
>   6. for i in `seq 1 20`; do sudo dd if=/dev/zero of=/mnt/dummy$i bs=512M count=1; done
>   7. sudo lvcreate --snapshot /dev/secvol/seclv --name tmp_backup1 --extents '90%FREE'
>   8. NEWMOUNT=$(mktemp -t -d mount.backup_XXX)
>   9. sudo mount -v -o ro /dev/secvol/tmp_backup1 $NEWMOUNT
> 
> To replicate, simply read one of those 512mb files:
>   10. time cat $NEWMOUNT/dummy1 1> /dev/null
> 
> On a stock xenial kernel, expect to see the following:
> 
> 4.4.0-151-generic #178-Ubuntu
> 
> $ time cat /tmp/mount.backup_TYD/dummy1 1> /dev/null
> 
> real 0m42.693s
> user 0m0.008s
> sys 0m0.388s
> $ cat /sys/block/nvme1n1/queue/nomerges
> 2
> 
> On a patched xenial kernel, performance is restored:
> 
> 4.4.0-151-generic #178+hf228435v20190618b1-Ubuntu
> 
> $ time cat /tmp/mount.backup_aId/dummy1 1> /dev/null
> 
> real 0m1.773s
> user 0m0.008s
> sys 0m0.184s
> $ cat /sys/block/nvme1n1/queue/nomerges
> 0
> 
> [Regression Potential]
> 
> Cherry-picking "NVMe: Allow request merges" changes the default request policy 
> for NVMe drives, which may give some cause for concern in both terms of 
> stability and performance for other workloads.
> 
> Regarding stability, this flag was originally set when the NVMe driver was
> bio based, before the driver had been converted to blk-mq and separated out 
> from /block. You can read a mailing list thread about it here:
> 
> https://lists.infradead.org/pipermail/linux-nvme/2016-February/003946.html
> 
> Along with the commit "MD: make bio mergeable" there is no reason to not allow 
> requests to be mergeable for the new NVMe driver.
> 
> Regarding performance for other workloads, I reference the commit which
> QUEUE_FLAG_NOMERGES or nomerges == 2 was introduced:
> commit: 488991e28e55b4fbca8067edf0259f69d1a6f92c
> subject: block: Added in stricter no merge semantics for block I/O
> 
> nomerges Throughput %System Improvement (tput / %sys)
> -------- ------------ ----------- -------------------------
> 0 12.45 MB/sec 0.669365609
> 1 12.50 MB/sec 0.641519199 0.40% / 2.71%
> 2 12.52 MB/sec 0.639849750 0.56% / 2.96%
> 
> It shows a 0.56% performance increase for no merging / 2, over allowing
> merging / 0 for random IO workloads.
> 
> Comparing this with the 14x-25x performance degradation for reads where requests
> are not able to be merged, it is clear that changing the default to 0 will not
> impact any other workloads by any significant margin.
> 
> The commit is also present in Linux 4.5 mainline, can be cleanly cherry-picked 
> and is still present in the kernel to this day, and after review of the NVMe 
> driver, I believe there will be no regressions.
> 
> If you are interested in testing, I have prepared two ppas with
> ef2d4615c59efb312e531a5e949970f37ca1c841 patched:
> 
> linux-image-generic: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test-generic
> linux-image-aws: https://launchpad.net/~mruffell/+archive/ubuntu/sf228435-test
> 
> Keith Busch (1):
>   NVMe: Allow request merges
> 
>  drivers/nvme/host/core.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Clean cherry pick, we are carrying this in Bionic+, it's a small fix and
has good test results. Seems good to me. Thanks Matthew.

Acked-by: Colin Ian King <colin.king at canonical.com>