[Bug 2031383] Re: Performance issue on mdraid5 when the number of devices more than 4

Tue Aug 15 12:09:30 UTC 2023

** Summary changed:

- RAID5 performance issue on mdraid5 when the number of devices more than 4
+ Performance issue on mdraid5 when the number of devices more than 4

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to mdadm in Ubuntu.
https://bugs.launchpad.net/bugs/2031383

Title:
  Performance issue on mdraid5 when the number of devices more than 4

Status in mdadm package in Ubuntu:
  New

Bug description:
  Hi there.

  I have encountered a significant increase in the max latency on the 4k
  random write pattern on mdraid5 in the case when the number of devices
  in the array becomes more than 4.

  Environment:
  OS: Ubuntu 20.04
  kernel: 5.15.0-79 (HWE)
  NVMe: 5x Solidigm D7-5620 1.6TB (FW: 9CV10410)

  group_thread_cnt and stripe_cache_size parameters are set via the udev rules file:
  cat /etc/udev/rules.d/60-md-stripe-cache.rules
  SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/group_thread_cnt}="6"
  SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/stripe_cache_size}="512"

  mdraid5 on top of 4x NVMe drives:
  #---------------
  cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid5 nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
        4688040960 blocks super 1.2 level 5, 4k chunk, algorithm 2 [4/4] [UUUU]
        bitmap: 0/12 pages [0KB], 65536KB chunk
  #---------------

  Then run fio tests:
  for i in {1..3}; do echo test "$i"; fio --name=nvme --numjobs=8 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0p1 --runtime=600 --time_based=1 --ramp_time=0; done
  fio results:
  Test 1:
  ...
  write: IOPS=250k, BW=976MiB/s (1023MB/s)(572GiB/600002msec);
  lat (usec): min=58, max=9519, avg=1024.02, stdev=1036.23

  Test 2:
  ...
  write: IOPS=291k, BW=1138MiB/s (1193MB/s)(667GiB/600002msec); 0 zone resets
  lat (usec): min=43, max=19160, avg=878.25, stdev=820.79

  Test 3:
  ...
  write: IOPS=301k, BW=1176MiB/s (1233MB/s)(689GiB/600003msec); 0 zone resets
  lat (usec): min=48, max=7900, avg=850.05, stdev=763.24
  ...

  Max latency is 19160 usec (test 2).

  mdraid5 on top of 5x NVMe drives:
  #---------------
  cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid5 nvme4n1p1[4] nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
        6250721280 blocks super 1.2 level 5, 4k chunk, algorithm 2 [5/5] [UUUUU]
        bitmap: 10/12 pages [40KB], 65536KB chunk
  #---------------
  Running the same test:
  for i in {1..3}; do echo test "$i"; fio --name=nvme --numjobs=8 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0p1 --runtime=600 --time_based=1 --ramp_time=0; done

  fio results:
  Test 1:
  ...
  write: IOPS=375k, BW=1466MiB/s (1537MB/s)(859GiB/600002msec); 0 zone resets
  lat (usec): min=78, max=28966k, avg=681.56, stdev=3300.12

  Test 2:
  ...
  write: IOPS=390k, BW=1524MiB/s (1598MB/s)(893GiB/600001msec); 0 zone resets
  lat (usec): min=77, max=63847k, avg=655.85, stdev=6565.15
  ...

  Test 3:
  ...
  write: IOPS=391k, BW=1526MiB/s (1600MB/s)(894GiB/600002msec); 0 zone resets
  lat (usec): min=79, max=60377k, avg=654.74, stdev=6081.22
  ...

  Final:
  mdraid5 on top of 4x NVMe drives: max latency - 19160 usec.
  mdraid5 on top of 5x NVMe drives: max latency - 63847k usec. 

  As you can see the max latency is a significant increase to 63847k
  usec (test 2).

  If I increase the runtime to 3600/7200 sec, I have see a hung task in dmesg:
  ...
  [11480.292296] INFO: task fio:2501 blocked for more than 120 seconds.
  [11480.292320]       Not tainted 5.15.0-79-generic #85-Ubuntu
  [11480.292341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [11480.292369] task:fio             state:D stack:    0 pid: 2501 ppid:  2465 flags:0x00004002
  ...

  To eliminate the problem with my NVMe drives, I have built an array on RAM drives and got the same behavior.

  modprobe brd rd_nr=6 rd_size=10485760

  mdraid5 on top of 3x RAM drives:
  mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=3 /dev/ram0 /dev/ram1 /dev/ram2
  #---------------
  cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid5 ram2[3] ram1[1] ram0[0]
        20953088 blocks super 1.2 level 5, 4k chunk, algorithm 2 [3/3] [UUU]
        bitmap: 1/1 pages [4KB], 65536KB chunk
  #---------------  

  for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16
  --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1
  --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1
  --ramp_time=0; done

  fio results:
  Test 1:
  ...
  write: IOPS=497k, BW=1939MiB/s (2034MB/s)(1136GiB/600003msec); 0 zone resets
  lat (usec): min=466, max=6171, avg=1030.71, stdev=39.32
  ...

  Test 2:
  ...
  write: IOPS=497k, BW=1941MiB/s (2035MB/s)(1137GiB/600003msec); 0 zone resets
  lat (usec): min=461, max=6223, avg=1030.06, stdev=39.38
  ...

  Test 3:
  ...
  write: IOPS=497k, BW=1940MiB/s (2034MB/s)(1136GiB/600002msec); 0 zone resets
  lat (usec): min=474, max=6179, avg=1030.68, stdev=39.29
  ...

  Max latency is 6223 usec (test 2).

  mdraid5 on top 4x RAM drives:
  mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=4 /dev/ram0 /dev/ram1 /dev/ram2 /dev/ram3
  #---------------
  cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid5 ram3[4] ram2[2] ram1[1] ram0[0]
        31429632 blocks super 1.2 level 5, 4k chunk, algorithm 2 [4/4] [UUUU]
        bitmap: 1/1 pages [4KB], 65536KB chunk
  #---------------

  for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16
  --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1
  --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1
  --ramp_time=0; done

  fio results:
  Test 1:
  ...
  write: IOPS=438k, BW=1712MiB/s (1796MB/s)(1003GiB/600002msec); 0 zone resets
  lat (usec): min=468, max=6902, avg=1167.45, stdev=46.17
  ...

  Test 2:
  ...
  write: IOPS=438k, BW=1711MiB/s (1794MB/s)(1002GiB/600004msec); 0 zone resets
  lat (usec): min=470, max=7689, avg=1168.49, stdev=46.14
  ...

  Test 3:
  ...
  write: IOPS=438k, BW=1712MiB/s (1796MB/s)(1003GiB/600003msec); 0 zone resets
  lat (usec): min=479, max=6376, avg=1167.40, stdev=46.18
  ...

  Max latency is 7689 usec (test 2).

  mdraid5 on top 5x RAM drives:
  mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=5 /dev/ram0 /dev/ram1 /dev/ram2 /dev/ram3 /dev/ram4
  #---------------
  cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid5 ram4[5] ram3[3] ram2[2] ram1[1] ram0[0]
        41906176 blocks super 1.2 level 5, 4k chunk, algorithm 2 [5/5] [UUUUU]
        bitmap: 0/1 pages [0KB], 65536KB chunk
  #---------------
  for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1 --ramp_time=0; done

  fio results:
  Test 1:
  ...
  write: IOPS=452k, BW=1764MiB/s (1850MB/s)(1034GiB/600001msec); 0 zone resets
  lat (usec): min=13, max=68868k, avg=1133.11, stdev=79882.97
  ...

  Test 2:
  ...
  write: IOPS=451k, BW=1763MiB/s (1849MB/s)(1033GiB/600001msec); 0 zone resets
  lat (usec): min=11, max=45339k, avg=1134.04, stdev=78829.34
  ...

  Test 3:
  ...
  write: IOPS=453k, BW=1770MiB/s (1856MB/s)(1037GiB/600001msec); 0 zone resets
  lat (usec): min=12, max=63593k, avg=1129.34, stdev=84268.37
  ...

  Max latency is 68868k usec (test 1).

  Final:
  mdraid5 on top of 3x RAM drives: max latency - 6223 usec.
  mdraid5 on top of 4x RAM drives: max latency - 7689 usec.
  mdraid5 on top of 5x RAM drives: max latency - 68868k usec.

  I also reproduced this behavior on mdraid4 and mdraid5 in CentOS 7,
  CentOS 9, and Ubuntu 22.04 with kernels 5.15.0-79 and 6.4(mainline).

  But I can't reproduce this behavior on mdraid6.

  Could you please help me to understand why it happens and if there is any chance to fix that?
  Let me know if you need more detailed information about my environment or needed to run more tests.

  Thank you in advance.

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: mdadm 4.1-5ubuntu1.2
  ProcVersionSignature: Ubuntu 5.15.0-79.86~20.04.2-generic 5.15.111
  Uname: Linux 5.15.0-79-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  ApportVersion: 2.20.11-0ubuntu27.27
  Architecture: amd64
  CasperMD5CheckResult: skip
  Date: Tue Aug 15 08:54:39 2023
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  Lsusb-t:

  Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
  MDadmExamine.dev.sda:
   /dev/sda:
      MBR Magic : aa55
   Partition[0] :     62914559 sectors at            1 (type ee)
  MDadmExamine.dev.sda1: Error: command ['/sbin/mdadm', '-E', '/dev/sda1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda1.
  MDadmExamine.dev.sda2:
   /dev/sda2:
      MBR Magic : aa55
  MDadmExamine.dev.sda3: Error: command ['/sbin/mdadm', '-E', '/dev/sda3'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda3.
  MDadmExamine.dev.sda4: Error: command ['/sbin/mdadm', '-E', '/dev/sda4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda4.
  MachineType: VMware, Inc. VMware Virtual Platform
  ProcEnviron:
   LANGUAGE=en_US:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-79-generic root=/dev/mapper/main-root ro quiet
  ProcMounts: Error: [Errno 40] Too many levels of symbolic links: '/proc/mounts'
  SourcePackage: mdadm
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 11/12/2020
  dmi.bios.release: 4.6
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: 6.00
  dmi.board.name: 440BX Desktop Reference Platform
  dmi.board.vendor: Intel Corporation
  dmi.board.version: None
  dmi.chassis.asset.tag: No Asset Tag
  dmi.chassis.type: 1
  dmi.chassis.vendor: No Enclosure
  dmi.chassis.version: N/A
  dmi.ec.firmware.release: 0.0
  dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku:
  dmi.product.name: VMware Virtual Platform
  dmi.product.version: None
  dmi.sys.vendor: VMware, Inc.
  etc.blkid.tab: Error: [Errno 2] No such file or directory: '/etc/blkid.tab'

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/2031383/+subscriptions