[Bug 2013543] Re: systemctl daemon-reexec forgets running services and starts everything new
Erik Wasser
2013543 at bugs.launchpad.net
Thu Apr 20 06:27:28 UTC 2023
@enr0n: Thanks for your help.
I've deleted my old comment because I messed up the output of the
container.
Here are the outputs of your test script. Output of the host:
hostsystem ~ # bash /tmp/test.sh
cgroupsv1
Output of the container:
container ~ $ bash /tmp/test.sh
hybrid
So, there is a mismatch of the cgroups versions. What does that mean?
I've attached the output of the `journalctl -t systemd -b 0`. Do you
need an additional output after I've executed the command `systemctl
daemon-reexec`?
Note: Executing the test scripts on a working(!) container with Ubuntu
20.04 and systemd 245.4-4ubuntu3.20 outputs the following:
hostsystem: cgroupsv1
container: hybrid
** Attachment added: "Output of the container with `journalctl -t systemd -b 0` and LogLevel=debug"
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2013543/+attachment/5665411/+files/logs.txt.gz
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/2013543
Title:
systemctl daemon-reexec forgets running services and starts everything
new
Status in systemd package in Ubuntu:
Incomplete
Status in systemd source package in Jammy:
Incomplete
Bug description:
# Our problem #
During a regular update of our container environment, `systemd` (and
the related packages libpam-systemd, libsystemd0, libudev1, systemd-
sysv and udev) were updated from `249.11-0ubuntu3.6` to
`249.11-0ubuntu3.7`. We're talking only about Ubuntu 22.04. Our Ubuntu
20.04 is working fine with `systemctl daemon-reexec`.
In my opinion, the update was not the problem because we've tried
downgrading and tried these versions: (current) `249.11-0ubuntu3.7`,
`249.11-0ubuntu3.6`, `249.11-0ubuntu3.4` and `249.11-0ubuntu3.3`. The
symptoms were the same.
# Symptoms #
The `/var/lib/dpkg/info/systemd.postinst` executes a `systemctl
daemon-reexec` and that ended in a disaster. It seems that `systemd`
is forgetting all it started children and tries to start nearly every
configured service again. Naturally, the old services are still
running, and the ports can't be opened twice and `systemd` won't give
up. Here are some(!) of the logfiles:
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting Create Volatile Files and Directories...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found left-over process 130 (systemd-udevd) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found left-over process 31475 (systemd-udevd) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found left-over process 31476 (systemd-udevd) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
And...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target System Initialization.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily apt download activities.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily apt upgrade and clean activities.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily dpkg database backup timer.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Periodic ext4 Online Metadata Check for All Filesystems.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Condition check resulted in Discard unused blocks once a week being skipped.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily rotation of log files.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily man-db regeneration.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Message of the Day.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Clean PHP session files every 30 mins.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Update the plocate database daily.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily Cleanup of Temporary Directories.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target Basic System.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: System is tainted: cgroupsv1
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target Timer Units.
And...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Found left-over process 206 (atd) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting Deferred execution scheduler...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: cron.service: Found left-over process 164 (cron) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Regular background program processing daemon.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: dbus.service: Found left-over process 177 (dbus-daemon) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started D-Bus System Message Bus.
And...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: rsyslog.service: Found left-over process 204 (rsyslogd) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Failed with result 'exit-code'.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Unit process 206 (atd) remains running after unit stopped.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found left-over process 382 (apache2) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found left-over process 392 (apache2) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found left-over process 397 (apache2) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found left-over process 3052 (apache2) in control group while starting unit. Ignoring.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting The Apache HTTP Server...
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Stopped Deferred execution scheduler.
Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Found left-over process 206 (atd) in control group while starting unit. Ignoring.
And...
Mar 31 12:51:40 FQDN_REDACTED sshd[31772]: error: Bind to port 22 on
0.0.0.0 failed: Address already in use.
And...
Mar 31 12:52:06 FQDN_REDACTED systemd[1]: Started The Salt Minion.
Mar 31 12:52:06 FQDN_REDACTED salt-minion[32339]: The Salt Minion is shutdown.
Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Main process exited, code=exited, status=1/FAILURE
Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Failed with result 'exit-code'.
Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Unit process 2808 (/opt/saltstack/) remains running after unit stopped.
Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Unit process 2848 (/opt/saltstack/) remains running after unit stopped.
Other internal `systemd` process were started again:
root 1 0.0 0.1 101204 12444 ? Ss 10:19 0:03 /lib/systemd/systemd -z --system --deserialize 16
root 75 0.0 0.1 31440 13484 ? Ss 10:19 0:00 /lib/systemd/systemd-journald
systemd+ 159 0.0 0.0 16124 8004 ? Ss 10:19 0:00 /lib/systemd/systemd-networkd
message+ 177 0.0 0.0 8252 4440 ? Ss 10:19 0:00 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 205 0.0 0.0 14908 6464 ? Ss 10:19 0:00 /lib/systemd/systemd-logind
systemd+ 223 0.0 0.1 25268 12592 ? Ss 10:19 0:00 /lib/systemd/systemd-resolved
root 31424 0.0 0.1 31424 13636 ? Ss 12:51 0:00 /lib/systemd/systemd-journald
systemd+ 31636 0.0 0.0 16124 6588 ? Ss 12:51 0:00 /lib/systemd/systemd-networkd
message+ 31639 0.0 0.0 8124 3804 ? Ss 12:51 0:00 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 31682 0.0 0.0 14908 6480 ? Ss 12:51 0:00 /lib/systemd/systemd-logind
systemd+ 31686 0.0 0.1 25268 12580 ? Ss 12:51 0:00 /lib/systemd/systemd-resolved
root 32087 0.0 0.0 21436 5252 ? Ss 12:51 0:00 /lib/systemd/systemd-udevd
You can either kill all the old processes and restart them, and then
everything is fine. Or you can reboot the container. Besides that
`systemctl daemon-reexec` the `systemd` version is running fine.
`systemctl daemon-reload` is working like a charme.
# Normal case #
In the normal case a `systemctl daemon-reexec` just prints only a few
lines:
Mar 31 14:21:58 FQDN_REDACTED systemd[1]: Reexecuting.
Mar 31 14:21:58 FQDN_REDACTED systemd[1]: systemd 249.11-0ubuntu3.7 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Mar 31 14:21:58 FQDN_REDACTED systemd[1]: Detected architecture x86-64.
# Testcase #
Doing a `systemctl daemon-restart` and `ssh localhost` shows the
problem. `systemctl` removes the directory `/run/sshd` during the
reexec and `ssh` will refuse further connects because the directory is
missing.
$ systemctl daemon-restart
$ ssh root at localhost
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 22
$
Killing the old instance of SSH and restarting it will work.
# Some details to the hardware #
Our metal runs OpenVZ/Virtuozzo with this kernel (without any
problems):
> Linux FQDN_REDACTED 3.10.0-1127.18.2.vz7.163.46 #1 SMP Fri Nov 20
21:47:55 MSK 2020 x86_64 x86_64 x86_64 GNU/Linux
The container with the `systemctl daemon-reexec` problem reports the
following kernel:
Linux FQDN_REDACTED 5.4.0 #1 SMP Thu Apr 22 16:18:59 MSK 2021 x86_64
x86_64 x86_64 GNU/Linux
# Upshot #
* Can somebody help me with this issue?
* Why is `systemctl` losing its internal state about the running processes/services?
* Why is `systemctl` restarting everything?
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2013543/+subscriptions
More information about the foundations-bugs
mailing list