[Bug 1888675] Re: [sru] fail to extend in-use fibre channel volume due to multipath-tools version

nikhil kshirsagar 1888675 at bugs.launchpad.net
Wed Sep 20 11:14:41 UTC 2023


I have tested this fix using gdb breakpoints to simulate the "timeout"
from multipathd map resize.

I've verified the fix code is called correctly in this situation using
the proposed package, and that the resize command is resent.

After multipathd map resize stops timing out, we see the new size
reflecting in both the underlying device as well as the mpath device.

I am marking the verification flags accordingly, and also attaching the
testing details file to the bug for reference.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1888675

Title:
  [sru] fail to extend in-use fibre channel volume due to multipath-
  tools version

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Triaged
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in os-brick:
  Fix Released
Status in python-os-brick package in Ubuntu:
  Fix Released
Status in python-os-brick source package in Jammy:
  Fix Committed

Bug description:
  [IMPACT]

  The `multipathd reconfigure` has became a asynchronous command since the 0.6.1 version of multipath-tools. There is a difference as follows:
  https://github.com/openSUSE/multipath-tools/blob/0.6.0/multipathd/main.c#L997
  https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/main.c#L1135

  That leads to a failure to extend in-use fibre channel volume, because
  `multipathd resize map` will output 'timeout' before `multipathd
  reconfigure` command finishes when `multipathd resize map` command
  will be executed as soon as `multipathd reconfigure` command is
  executed.

  However, current code only considers the 'fail' result and so timeouts
  are not retried, but instead end up as failed, resulting in the FC
  volume not extending.

  [TEST PLAN]

  1. Guarantee that there are enough fibre channel volumes attached on
  the compute node so that `multipathd reconfigure` requires a huge
  amount of time.

  2. Create a server on the compute node and the server name we call
  'c1'.

  3. Attach a volume whose name is 'v1' to the server 'c1' and the size
  of 'v1' is 4G.

  $ openstack server add volume c1 v1

  4. Extend the volume which is called 'v1' to 8G.

  $ cinder --os-volume-api-version 3.42 extend v1 8

  Check the size using the command of 'fdisk -l') and verify from the
  logs (see [OTHER INFO])

  Without the fix, after the volume have been extended from 4G to 8G,
  the volume in the instance is still 4G.The fibre channel volume
  scsi_wwn has been changed to 8G.

  With the fix, the new size will reflect immediately because if
  multipathd resize map returns a timeout, we keep re-trying the same
  multipathd resize map command for 120 seconds more, giving a chance
  for the (now asynchronous) 'multipathd reconfigure' to complete and
  hence letting multipath resize map run succcessfully when we retry.

  
  [WHERE PROBLEMS COULD OCCUR]

  I have verified the code is robust and I do not anticipate any issues.
  The patch is already merged to master, and at the time of writing
  this, has received 2 acks for the merge into
  yoga.(https://review.opendev.org/c/openstack/os-brick/+/888343).
  "multipathd resize map" will not return anything but 1 or 0, (see
  https://github.com/openSUSE/multipath-
  tools/blob/0.6.1/multipathd/cli_handlers.c#L702C1-L719C2 ) and if it
  returns 1, the ProcessExecutionError exception will indeed be raised,
  because this exception is raised for any return value from the
  executed command apart from a default of [0].
  (https://docs.openstack.org/oslo.concurrency/latest/reference/processutils.html)

  However if the timeout is for genuine reasons, and multipath timeout
  is set to a smaller value, say 30 seconds, we would be needlessly
  waiting 120 seconds instead of failing the operation at 30 seconds.
  Also, we could run into this same issue if the resize map operation
  takes even longer than 120 seconds but that is unlikely and I
  anticipate the multipathd timeout will also be set to a max of 120
  seconds.

  [OTHER INFO]

  Logs WITHOUT the fix show
  ==============
  2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
  2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
  2020-07-23 12:42:48.254 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] Find Multipath device file for volume WWN 3600502196
  2020-07-23 12:42:48.355 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088current size 4294967296
  2020-07-23 12:42:48.449 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088new size 4294967296

  The logs indicate that the current (i.e older) size (4294967296) is
  the same as the new size. (4294967296)

  Note that the fibre channel volume scsi_wwn has been changed to the
  new size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1888675/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list