[Bug 1888675] Re: [sru] fail to extend in-use fibre channel volume due to multipath-tools version
nikhil kshirsagar
1888675 at bugs.launchpad.net
Wed Sep 20 11:14:41 UTC 2023
I have tested this fix using gdb breakpoints to simulate the "timeout"
from multipathd map resize.
I've verified the fix code is called correctly in this situation using
the proposed package, and that the resize command is resent.
After multipathd map resize stops timing out, we see the new size
reflecting in both the underlying device as well as the mpath device.
I am marking the verification flags accordingly, and also attaching the
testing details file to the bug for reference.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1888675
Title:
[sru] fail to extend in-use fibre channel volume due to multipath-
tools version
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
Triaged
Status in Ubuntu Cloud Archive zed series:
Fix Released
Status in os-brick:
Fix Released
Status in python-os-brick package in Ubuntu:
Fix Released
Status in python-os-brick source package in Jammy:
Fix Committed
Bug description:
[IMPACT]
The `multipathd reconfigure` has became a asynchronous command since the 0.6.1 version of multipath-tools. There is a difference as follows:
https://github.com/openSUSE/multipath-tools/blob/0.6.0/multipathd/main.c#L997
https://github.com/openSUSE/multipath-tools/blob/0.6.1/multipathd/main.c#L1135
That leads to a failure to extend in-use fibre channel volume, because
`multipathd resize map` will output 'timeout' before `multipathd
reconfigure` command finishes when `multipathd resize map` command
will be executed as soon as `multipathd reconfigure` command is
executed.
However, current code only considers the 'fail' result and so timeouts
are not retried, but instead end up as failed, resulting in the FC
volume not extending.
[TEST PLAN]
1. Guarantee that there are enough fibre channel volumes attached on
the compute node so that `multipathd reconfigure` requires a huge
amount of time.
2. Create a server on the compute node and the server name we call
'c1'.
3. Attach a volume whose name is 'v1' to the server 'c1' and the size
of 'v1' is 4G.
$ openstack server add volume c1 v1
4. Extend the volume which is called 'v1' to 8G.
$ cinder --os-volume-api-version 3.42 extend v1 8
Check the size using the command of 'fdisk -l') and verify from the
logs (see [OTHER INFO])
Without the fix, after the volume have been extended from 4G to 8G,
the volume in the instance is still 4G.The fibre channel volume
scsi_wwn has been changed to 8G.
With the fix, the new size will reflect immediately because if
multipathd resize map returns a timeout, we keep re-trying the same
multipathd resize map command for 120 seconds more, giving a chance
for the (now asynchronous) 'multipathd reconfigure' to complete and
hence letting multipath resize map run succcessfully when we retry.
[WHERE PROBLEMS COULD OCCUR]
I have verified the code is robust and I do not anticipate any issues.
The patch is already merged to master, and at the time of writing
this, has received 2 acks for the merge into
yoga.(https://review.opendev.org/c/openstack/os-brick/+/888343).
"multipathd resize map" will not return anything but 1 or 0, (see
https://github.com/openSUSE/multipath-
tools/blob/0.6.1/multipathd/cli_handlers.c#L702C1-L719C2 ) and if it
returns 1, the ProcessExecutionError exception will indeed be raised,
because this exception is raised for any return value from the
executed command apart from a default of [0].
(https://docs.openstack.org/oslo.concurrency/latest/reference/processutils.html)
However if the timeout is for genuine reasons, and multipath timeout
is set to a smaller value, say 30 seconds, we would be needlessly
waiting 120 seconds instead of failing the operation at 30 seconds.
Also, we could run into this same issue if the resize map operation
takes even longer than 120 seconds but that is unlikely and I
anticipate the multipathd timeout will also be set to a max of 120
seconds.
[OTHER INFO]
Logs WITHOUT the fix show
==============
2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
2020-07-23 12:42:46.764 2713929 INFO nova.compute.manager [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0e0ec97 4f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] [instance: ddd3010f-fdf9-4e50-a363-edd02532e683] Cinder d-c206-4713-8381-1ee47d412f31; extending it to detect new size
2020-07-23 12:42:48.254 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] Find Multipath device file for volume WWN 3600502196
2020-07-23 12:42:48.355 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088current size 4294967296
2020-07-23 12:42:48.449 2713929 INFO os_brick.initiator.linuxscsi [req-8defc1e3-c514-4673-a3b7-98b5343ba1cd 46ff538c684b4816b9454bfdc0825c54e0f20deff2 - 15396630649143a78afa714b3e4a0adb 15396630649143a78afa714b3e4a0adb] mpath(/dev/disk/by-id/dm-uuid-mpath-360050767088new size 4294967296
The logs indicate that the current (i.e older) size (4294967296) is
the same as the new size. (4294967296)
Note that the fibre channel volume scsi_wwn has been changed to the
new size.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1888675/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list