[Bug 1815599] Re: multipath shows '#:#:#:#' for iscsi device after error injection

Fri Mar 22 06:58:41 UTC 2019

@JFH - As just discussed you have mentioned that you have discussed the
existing tunables to Heinz Werner. Thanks for linking these here again
to make sure that Shixm knows.

Somewhat reminds me of bug 1540407 - but those changes are in Ubuntu already since 16.04.
Same for the even older bug 1374999.
Your kernel and open-iscsi versions indicate that you are on Bionic is that correct?
Mutlipath tools should be on 

Your repro of:
  > Run storage side error inject 'node reset' for SVC
isn't clear to me, neither do I have a Storage server I'm allowed to do error in ject nor the tools/UI to control it.

Instead I have tried the repro that are available to me as in:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/7
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/8

But all of them worked, see below.
Note that I never reached the loss of the path info in the faulty state '#:#:#:#' - even in faulty state it kew the path info.

@Shixm - since you have a setup that can reproduce this, can you try if
any of the latter releases (Cosmic/Disco) already resolve the issue that
you are seeing? Then we could try to hunt down which changes might have
resolved it for you instead of assuming this would need a totally new
change.

@Shixm - Any way to reproduce this without 'node reset' for SVC?

Finally this might as well need subject matter expertise - can we make sure that IBMs zfcp experts (Devs and maybe Thorsten who drove the old bugs) are subscribed on the mirrored bug 175431?
@JFH - do you think you can check that with the IBM team?

------------
Test results when retrying to trigger the issue:

Approach #1 gives me this (which isn't exactly the same state)
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:0:1073954852 sdb 8:16  active faulty offline
  `- 0:0:1:1073954852 sdj 8:144 active ready running
If I add it abck after this it works just fine again:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active ready running
  `- 0:0:0:1073954852 sdb 8:16  active ready running

The second approach (Disable, sleep, enable of the adapter), check zfcp config
lszdev -t zfcp-host
DEVICE TYPE zfcp
  Description        : SCSI-over-Fibre Channel (FCP) devices and SCSI devices
  Modules            : zfcp
  Active             : yes
  Persistent         : yes

  ATTRIBUTE            ACTIVE   PERSISTENT
  allow_lun_scan       "1"      "1"
  datarouter           "1"      -
  dbflevel             "3"      -
  dbfsize              "4"      -
  dif                  "0"      -
  no_auto_port_rescan  "0"      -
  port_scan_backoff    "500"    -
  port_scan_ratelimit  "60000"  -
  queue_depth          "32"     -

Initially I get this (as expected):
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active i/o pending running
  `- 0:0:0:1073954852 sdb 8:16  active i/o pending running

Then after a while it reaches the final fault state:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 failed faulty running
  `- 0:0:0:1073954852 sdb 8:16  failed faulty running

After getting the paths back it immediately switches to:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=37 status=enabled
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 failed ready running
  `- 0:0:0:1073954852 sdb 8:16  failed ready running

And after less than 20 seconds fully recovers to:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96  active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active ready running
  `- 0:0:0:1073954852 sdb 8:16  active ready running

** Changed in: multipath-tools (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to multipath-tools in Ubuntu.
https://bugs.launchpad.net/bugs/1815599

Title:
  multipath shows '#:#:#:#' for iscsi device after error injection

Status in Ubuntu on IBM z Systems:
  New
Status in multipath-tools package in Ubuntu:
  Incomplete

Bug description:
  Problem Description:
  After error injection(reset one node for storage), 1 of 4 luns show '####' for half paths

  ---uname output---
  root at ilzlnx4:~# uname -a
  Linux ilzlnx4 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:43:05 UTC 2018 s390x s390x s390x GNU/Linux

  Machine Type = s390x

  --iscsi initiator
  root at ilzlnx4:~# dpkg  -l  | grep iscsi
  ii  open-iscsi                            2.0.874-5ubuntu2.6                s390x        iSCSI initiator tools

  ---Debugger---
  A debugger is not configured

  ---Steps to Reproduce---
  1 Mapping 4 luns via open-iscsi from SVC
  2 Running IO on these luns
  3 Run storage side error inject 'node reset' for SVC (about start at ?2019/02/11 05:14?)
  4 half of one luns' path show '#:#:#:#' and never recovered without manual intervention

  [2019/02/11 05:53:13]  INFO send: multipath -ll | cat  
  [2019/02/11 05:53:29]  INFO  
  3600507638085814a980000000000000a dm-3 IBM,2145 
  size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw 
  |-+- policy='service-time 0' prio=50 status=active 
  | |- 4:0:0:3 sdr 65:16 active ready running 
  | `- 6:0:0:3 sdu 65:64 active ready running 
  `-+- policy='service-time 0' prio=10 status=enabled 
    |- 1:0:0:3 sdh 8:112 active ready running 
    `- 2:0:0:3 sdl 8:176 active ready running 
  3600507638085814a9800000000000009 dm-4 IBM,2145 
  size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw 
  |-+- policy='service-time 0' prio=50 status=active 
  | |- 1:0:0:1 sdf 8:80  active ready running 
  | `- 2:0:0:1 sdj 8:144 active ready running 
  `-+- policy='service-time 0' prio=0 status=enabled 
    |- #:#:#:# sdo 8:224 active faulty running 
    `- #:#:#:# sds 65:32 active faulty running 
  3600507638085814a9800000000000008 dm-2 IBM,2145 
  size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw 
  |-+- policy='service-time 0' prio=50 status=active 
  | |- 4:0:0:2 sdq 65:0  active ready running 
  | `- 6:0:0:2 sdt 65:48 active ready running 
  `-+- policy='service-time 0' prio=10 status=enabled 
    |- 2:0:0:2 sdk 8:160 active ready running 
    `- 1:0:0:2 sdg 8:96  active ready running 
  3600507638085814a9800000000000006 dm-5 IBM,2145 
  size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw 
  |-+- policy='service-time 0' prio=50 status=active 
  | |- 4:0:0:0 sdm 8:192 active ready running 
  | `- 6:0:0:0 sdp 8:240 active ready running 
  `-+- policy='service-time 0' prio=10 status=enabled 
    |- 1:0:0:0 sde 8:64  active ready running 
    `- 2:0:0:0 sdi 8:128 active ready running

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1815599/+subscriptions