[Bug 2004551] Re: upgrade to lunar fails due to rescue-ssh.target or port 22 takeover

Thu Aug 3 21:45:56 UTC 2023

I spent some time debugging this issue today.  One of the MAAS nodes the
Server team uses (horsea) is affected by it (what Christian reported
above was actually part of a joint debugging effort between him and I
earlier today), so I took the opportunity to redeploy the system and try
to understand what was really going on.

I already had some good indication that the problem manifested when apt-
daily-upgrade.service was executed, so that's the first thing I did
after the machine booted, and voilà, the problem happened again.

I could not reproduce the issue just by reinstalling openssh-server, so
my next best guess was that something else was interfering with the
service restart.  I tried cloud-init, to no avail.  Then, I decided to
try to forcefully upgrade systemd, and that caused the issue to manifest
again.  Good, now I had a small reproducer that did not involve
redeploying the machine.

With that in mind, I decided to check what systemd (or its maintscripts)
were doing that could cause such problem.  I remembered that one of the
first things that caught my attention when I was debugging the issue
earlier with Christian was the fact that there were *a lot* of systemd-
related warnings on journalctl like the following:

Aug 03 21:05:35 node-horsea systemd[1]: cron.service: Found left-over process 970 (cron) in control group while starting unit. Ignoring.
Aug 03 21:05:35 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Had these been just related to ssh, I'd have thought that there was
something wrong with openssh-server's maintscripts, maybe.  But as I
said, these warnings were related to *many* services.  They're basically
saying that systemd somehow "lost" control of the process it was
managing, which is concerning.  It also seemed to explain why sshd is
complaining about not being able to bind to port 22 because it's in use.
I could verify that there is indeed a "left-over" sshd process still
bound to port 22.

After some more thinking & log reading, I grew very suspicious that the
command "systemctl daemon-reexec" was the culprit here.  This command is
performed by systemd's postinst script.  And finally, after a reboot to
make sure that the system was back to a sane state, I was able to
confirm the suspicion:

ubuntu at node-horsea:~$ sudo systemctl daemon-reexec                                                                                                                                             
ubuntu at node-horsea:~$ sudo systemctl status ssh                                                                                                                                                
○ ssh.service - OpenBSD Secure Shell server                                                                                                                                                    
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)                                                                                                         
     Active: inactive (dead)                                                                                                                                                                   
       Docs: man:sshd(8)                                                                                                                                                                       
             man:sshd_config(5)                                                                                                                                                                

Aug 03 21:26:21 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...                                                                                                                
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on 0.0.0.0 port 22.                                                                                                                    
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on :: port 22.                                                                                                                         
Aug 03 21:26:21 node-horsea systemd[1]: Started OpenBSD Secure Shell server.
Aug 03 21:29:03 node-horsea sshd[1189]: Accepted publickey for ubuntu from 10.172.68.194 port 36222 ssh2: RSA SHA256:7JLFXGJ/lkuhcSu8/OF8mHjZ8wMD1zEfNSuNGM5u2Lk
Aug 03 21:29:03 node-horsea sshd[1189]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)

You can see that the service is listed as inactive (dead) after the
daemon-reexec.  Now let's try to restart it:

ubuntu at node-horsea:~$ sudo systemctl restart ssh                                                
Job for ssh.service failed because the control process exited with error code.
See "systemctl status ssh.service" and "journalctl -xeu ssh.service" for details.
ubuntu at node-horsea:~$ sudo systemctl status ssh                                                 
× ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-08-03 21:31:25 UTC; 11s ago
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 1914 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
    Process: 1943 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS (code=exited, status=255/EXCEPTION)
   Main PID: 1943 (code=exited, status=255/EXCEPTION)
      Tasks: 1 (limit: 38215)
     Memory: 7.6M
        CPU: 27ms
     CGroup: /system.slice/ssh.service
             └─913 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Aug 03 21:31:25 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Found left-over process 913 (sshd) in control group while starting unit. Ignoring.
Aug 03 21:31:25 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on :: failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: fatal: Cannot bind any address.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Unit process 913 (sshd) remains running after unit stopped.
Aug 03 21:31:25 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.

And there you go: the problem now happens deterministically.

With that in mind, I also found bug #2013543 which is exactly about this
issue.  Upstream has fixed the problem, but work still needs to be done
to SRU the patch into Jammy/Lunar.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to openssh in Ubuntu.
https://bugs.launchpad.net/bugs/2004551

Title:
  upgrade to lunar fails due to rescue-ssh.target or port 22 takeover

Status in openssh package in Ubuntu:
  Confirmed
Status in systemd package in Ubuntu:
  New

Bug description:
  Hi,
  I just upgraded a system from Jammy to Lunar and openssh-server refuses to upgrade well.

  Setting up openssh-server (1:9.0p1-1ubuntu8) ...
  Replacing config file /etc/ssh/sshd_config with new version
  Replacing config file /etc/ssh/sshd_config with new version
  Synchronizing state of ssh.service with SysV service script with /lib/systemd/systemd-sysv-install.
  Executing: /lib/systemd/systemd-sysv-install disable ssh
  rescue-ssh.target is a disabled or a static unit not running, not starting it.
  Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
  dpkg: error processing package openssh-server (--configure):
   installed openssh-server package post-installation script subprocess returned error exit status 1
  Processing triggers for man-db (2.11.2-1) ...
  Processing triggers for libc-bin (2.36-0ubuntu4) ...
  Errors were encountered while processing:
   openssh-server
  Error: Timeout was reached
  needrestart is being skipped since dpkg has failed
  E: Sub-process /usr/bin/dpkg returned an error code (1)

  I'm not sure what exactly it is.
  This output complains about rescue-ssh.target and indeed that can not be started even directly.

  $ sudo systemctl start rescue-ssh.target
  A dependency job for rescue-ssh.target failed. See 'journalctl -xe' for details.

  And in postinst is a try to start it:
  $  grep rescue /var/lib/dpkg/info/openssh-server.postinst 
  		deb-systemd-invoke $_dh_action 'rescue-ssh.target' >/dev/null || true

  But I think the underlying issue is that ssh is already on, and I'm logged in via it.
  And that makes the service restart of the ssh socket which was added break.

  Feb 02 10:40:56 node-horsea systemd[104560]: ssh.socket: Failed to create listening socket ([::]:22): Address already in use
  Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed to receive listening socket ([::]:22): Input/output error
  Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed to listen on sockets: Input/output error
  Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed with result 'resources'.

  Now, whichever it is, it is hard to resolve.
  The only way to get the socket to own it would be rebooting so that sshd lets go and systemd can take over.
  I could reboot, but that is not the point.
  What if I'd want to get the service and upgrade completed before reboot.
  Because as of now dpkg considers the system unhappy, and that would usually be a sign for "better not reboot before being resolved" to me.

  One thing though, I have not upgraded with do-release-upgrade - would
  we / do we have magic there to make the ssh socket activation
  transition smoother?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/2004551/+subscriptions