[Bug 1987663] Re: cinder-volume: "Failed to re-export volume, setting to ERROR" with "tgtadm: failed to send request hdr to tgt daemon, Transport endpoint is not connected" on service startup

Mauricio Faria de Oliveira 1987663 at bugs.launchpad.net
Wed Oct 18 17:54:29 UTC 2023


** Description changed:

- Debian bug #1018084 [1]
- Debian MR on Salsa: [2]
+ [Impact]
  
- [1] https://bugs.debian.org/1018084
- [2] https://salsa.debian.org/openstack-team/services/cinder/-/merge_requests/2
+  * The cinder-volume service might fail to re-export volumes
+    in-use on startup if tgt.service isn't fully started yet.
+    
+  * This affects the 'lvm' driver with 'tgtadm' target helper
+    (which runs 'tgtadm' commands that need the service ready).
+    
+  * Snippets from /var/log/cinder/cinder-volume.log:
+  
+    Failed to re-export volume, setting to ERROR.
+    ...
+    Command: tgtadm --lld iscsi --op show --mode target
+    ...
+    Stderr: 'tgtadm: failed to send request hdr to tgt daemon,
+    Transport endpoint is not connected\n'
  
- Problem:
+  * This issue is more common in openstack compute nodes
+    with networking (ovs/ovn) that takes long to startup,
+    which might delay the startup of tgt.service _after_
+    cinder-volume.service.
  
- The cinder-volume.service unit is _not_ ordered after the tgt.service unit,
- and thus might fail to run 'tgtadm' commands because tgtd is not yet ready,
- on service start up / boot:
+ [Test Steps]
  
-  INFO cinder.service [-] Starting cinder-volume node (version 12.0.9)
-  ...
-  ERROR cinder.volume.manager [req-UUID - - - - -] Failed to re-export volume, setting to ERROR.: ProcessExecutionError: Unexpected error while running command.
-  Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgtadm --lld iscsi --op show --mode target
-  Exit code: 107
-  Stdout: u''
-  Stderr: u'tgtadm: failed to send request hdr to tgt daemon, Transport endpoint is not connected\n'
+  * Steps to reproduce are detailed in comment #2.
+    Summary:
+  
+  * Install mysql, rabbitmq-server, keystone, and cinder
+    (controller and storage nodes; backup node unneeded).
  
- Approach:
+  * Configure cinder-volume (storage node) for LVM backend
+    and tgtadm iSCSI helper (tgt.service).
+  
+  * Create a cinder volume, and configure it as 'in-use'.
  
- It should be OK to order `cinder-volume.service` (generated from `cinder-volume.init.in` template) with `After=tgt.service`,
- as the `cinder-volume` package has a `Depends: tgt` dependency, and actually restarts `tgt` on its install/remove scripts,
- thus it is already strongly bound to `tgt` at the packaging level, even if its runtime usage is optional (eg, other drivers).
+  * Simulate a start delay on tgt.service with a drop-in.
  
- # dpkg -s cinder-volume | grep Depends:
- Depends: ..., tgt, ...
+  * Restart services: cinder-volume.service tgt.service
+ 
+  * Check sequence of service startup.
+  
+  * Check status of the cinder volume: 
+    'in-use' (expected) or 'error' (bug).
+  
+  * Check log file /var/log/cinder/cinder-volume.log for
+    'tgtadm: failed to send request hdr to tgt daemon'.
+ 
+ [Regression Potential]
+ 
+  * The fix introduces systemd unit 'After=' and 'Wants='
+    properties for tgt.service in cinder-volume.service,
+    thus might delay the boot process (multi-user.target).
+ 
+      $ systemctl show cinder-volume.service | grep WantedBy=
+      WantedBy=multi-user.target
+    
+  * However, the boot process already waits on tgt.service
+    anyway, thus the difference (if any) should not be big,
+    and would provide more correct behavior.
+    
+      $ systemctl show tgt.service | grep WantedBy=
+      WantedBy=multi-user.target
+ 
+  * If tgt.service is not present (tgt package not installed)
+    _no errors_ occur, as both 'After=' and 'Wants=' are weak
+    ordering/dependency properties (man 5 systemd.unit).
+    
+ [Other Info]
+ 
+  * The fix uses a systemd service drop-in snippet because
+    the service unit is generated by openstack-pkg-tools
+    (pkgos-gen-systemd-unit) based on the 'init' service,
+    and it only emits 'Wants=' for network-online.target.
+    
+  * Changing that in openstack-pkg-tools changes behavior
+    in stable releases, and only manifest at build time,
+    for many openstack packages that have no issues now.
+    
+  * We'll continue to pursue the general improvement in
+    Debian, so it comes into Ubuntu development release,
+    but for the Ubuntu stable releases, this should do.
+ 
+ [Original Bug Description]
  
  Real-world example:
  
  1) Unit `ovs-vswitchd.service` took 2 minutes to start up.
  2) That delayed `network.target` (`ovs-vswitchd.service` has `Before=network.target`).
  3) That delayed `tgt.service` too (it has `After=network.target`).
  
  4) BUT that did _not_ delay `cinder-volume.service` (it does _not_ have `After=tgt.service`)
  5) So it failed to talk to tgtd with tgtadm, and volume re-export failed.
  
   $ cat sos_commands/logs/journalctl_--no-pager_--catalog_--boot \
     | grep -E -B1 'Subject: Unit (cinder-volume.service|tgt.service|ovs-vswitchd.service|network.target)'
  
   Aug 10 06:23:15 <HOST> systemd[1]: Started OpenStack Cinder Volume.
   -- Subject: Unit cinder-volume.service has finished start-up
   --
   <<< ERROR in cinder-volume.log >>>
  
   Aug 10 06:23:16 <HOST> systemd[1]: Starting Open vSwitch Forwarding Unit...
   -- Subject: Unit ovs-vswitchd.service has begun start-up
   --
   <<< DELAY of 2 minutes >>>
  
   Aug 10 06:25:17 <HOST> systemd[1]: Started Open vSwitch Forwarding Unit.
   -- Subject: Unit ovs-vswitchd.service has finished start-up
   --
   Aug 10 06:25:17 <HOST> systemd[1]: Reached target Network.
   -- Subject: Unit network.target has finished start-up
   --
   Aug 10 06:25:17 <HOST> systemd[1]: Starting (i)SCSI target daemon...
   -- Subject: Unit tgt.service has begun start-up
   --
   Aug 10 06:25:19 <HOST> systemd[1]: Started (i)SCSI target daemon.
   -- Subject: Unit tgt.service has finished start-up
  
   <<< NOW tgtd is running >>>
  
  @ var/log/cinder/cinder-volume.log
  
    9901 2022-08-10 06:23:21.515 3939 INFO cinder.service [-] Starting cinder-volume node (version 12.0.9)
   ...
    9932 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager [req-e67a852a-6cce-4073-a2de-d3487c85d586 - - - - -] Failed to re-export volume, setting to ERROR.: ProcessExecutionError: Unexpected error while running command.
    9933 Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgtadm --lld iscsi --op show --mode target
    9934 Exit code: 107
    9935 Stdout: u''
    9936 Stderr: u'tgtadm: failed to send request hdr to tgt daemon, Transport endpoint is not connected\n'
    9937 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager Traceback (most recent call last):
    9938 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 486, in init_host
    9939 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     self.driver.ensure_export(ctxt, volume)
    9940 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/volume/drivers/lvm.py", line 826, in ensure_export
    9941 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     self.target_driver.ensure_export(context, volume, volume_path)
    9942 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/volume/targets/iscsi.py", line 261, in ensure_export
    9943 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     old_name=None, **portals_config)
    9944 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/utils.py", line 818, in _wrapper
    9945 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     return r.call(f, *args, **kwargs)
    9946 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/retrying.py", line 206, in call
    9947 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     return attempt.get(self._wrap_exception)
    9948 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/retrying.py", line 247, in get
    9949 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     six.reraise(self.value[0], self.value[1], self.value[2])
    9950 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/retrying.py", line 200, in call
    9951 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    9952 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/volume/targets/tgt.py", line 141, in create_iscsi_target
    9953 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     run_as_root=True)
    9954 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/cinder/utils.py", line 126, in execute
    9955 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     return processutils.execute(*cmd, **kwargs)
    9956 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager   File "/usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py", line 424, in execute
    9957 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager     cmd=sanitized_cmd)
    9958 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager ProcessExecutionError: Unexpected error while running command.
    9959 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgtadm --lld iscsi --op show --mode target
    9960 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager Exit code: 107
    9961 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager Stdout: u''
    9962 2022-08-10 06:23:23.398 3932 ERROR cinder.volume.manager Stderr: u'tgtadm: failed to send request hdr to tgt daemon, Transport endpoint is not connected\n'

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to cinder in Ubuntu.
https://bugs.launchpad.net/bugs/1987663

Title:
  cinder-volume: "Failed to re-export volume, setting to ERROR" with
  "tgtadm: failed to send request hdr to tgt daemon, Transport endpoint
  is not connected" on service startup

Status in cinder package in Ubuntu:
  In Progress
Status in cinder package in Debian:
  Fix Released

Bug description:
  [Impact]

   * The cinder-volume service might fail to re-export volumes
     in-use on startup if tgt.service isn't fully started yet.

   * This affects the 'lvm' driver with 'tgtadm' target helper
     (which runs 'tgtadm' commands that need the service ready).

   * Snippets from /var/log/cinder/cinder-volume.log:

     Failed to re-export volume, setting to ERROR.
     ...
     Command: tgtadm --lld iscsi --op show --mode target
     ...
     Stderr: 'tgtadm: failed to send request hdr to tgt daemon,
     Transport endpoint is not connected\n'

   * This issue is more common in openstack compute nodes
     with networking (ovs/ovn) that takes long to startup,
     which might delay the startup of tgt.service _after_
     cinder-volume.service.

  [Test Steps]

   * Steps to reproduce are detailed in comment #3.
     Summary:

   * Install mysql, rabbitmq-server, keystone, and cinder
     (controller and storage nodes; backup node unneeded).

   * Configure cinder-volume (storage node) for LVM backend
     and tgtadm iSCSI helper (tgt.service).

   * Create a cinder volume, and configure it as 'in-use'.

   * Simulate a start delay on tgt.service with a drop-in.

   * Restart services: cinder-volume.service tgt.service

   * Check sequence of service startup.

   * Check status of the cinder volume:
     'in-use' (expected) or 'error' (bug).

   * Check log file /var/log/cinder/cinder-volume.log for
     'tgtadm: failed to send request hdr to tgt daemon'.

  [Regression Potential]

   * The fix introduces systemd unit 'After=' and 'Wants='
     properties for tgt.service in cinder-volume.service,
     thus might delay the boot process (multi-user.target).

       $ systemctl show cinder-volume.service | grep WantedBy=
       WantedBy=multi-user.target

   * However, the boot process already waits on tgt.service
     anyway, thus the difference (if any) should not be big,
     and would provide more correct behavior.

       $ systemctl show tgt.service | grep WantedBy=
       WantedBy=multi-user.target

   * If tgt.service is not present (tgt package not installed)
     _no errors_ occur, as both 'After=' and 'Wants=' are weak
     ordering/dependency properties (man 5 systemd.unit).

  [Other Info]

   * The fix uses a systemd service drop-in snippet because
     the service unit is generated by openstack-pkg-tools
     (pkgos-gen-systemd-unit) based on the 'init' service,
     and it only emits 'Wants=' for network-online.target.

   * Changing that in openstack-pkg-tools changes behavior
     in stable releases, and only manifest at build time,
     for many openstack packages that have no issues now.

   * We'll continue to pursue the general improvement in
     Debian, so it comes into Ubuntu development release,
     but for the Ubuntu stable releases, this should do.

  [Original Bug Description]

  Real-world example in comment #2.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cinder/+bug/1987663/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list