[Bug 1874075] Re: rabbitmq-server startup timeouts differ between SysV and systemd

Mon Jun 29 05:44:18 UTC 2020

Okay, I made test packages for Bionic and Xenial based on the above:

The ppa is available here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp1874075-test

It contains (based off of -updates):
Xenial:
rabbitmq-server 	3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1
Bionic:
rabbitmq-server 	3.6.10-1ubuntu0.1+lp1874075v20200629b1 

Debdiffs for the above builds are:
Xenial: https://paste.ubuntu.com/p/Jm8ZctJzny/
Bionic: https://paste.ubuntu.com/p/j6cBPzgWMD/

On Bionic:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

On Xenial:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

It seems the timeouts happen at the mercy of
mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout
values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'.
Nicolas, it seems you are right, and that if we didn't want our services
to restart every 60 (xenial) or 300 (bionic) seconds, we would need to
adjust these timeouts. The problem is, we would have to introduce new
configuration files to do this, which is normally frowned on when doing
a SRU.

Now that we have Restart=on-failure and RestartSec=10 would I add config
to change mnesia_table_loading_retry_timeout? To be honest I am happy
with leaving them as is, and just relying on Restart=on-failure to do
its job. @ddstreet do you have any strong opinions? Is a service
restarting every 60 seconds unacceptable until the node can rejoin the
cluster?

Nicolas, can you install and test these packages and double check that
you also see what I see. If everything is good, you can submit new
debdiffs for Xenial and Bionic based on my ones, and we can get some new
builds into -proposed.

Nicolas, I think you are more or less right all along, and all you were
missing is Restart=on-failure and RestartSec=10 in the service file.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to rabbitmq-server in Ubuntu.
https://bugs.launchpad.net/bugs/1874075

Title:
  rabbitmq-server startup timeouts differ between SysV and systemd

Status in rabbitmq-server package in Ubuntu:
  Fix Released
Status in rabbitmq-server source package in Xenial:
  Fix Committed
Status in rabbitmq-server source package in Bionic:
  Fix Committed
Status in rabbitmq-server source package in Eoan:
  Won't Fix
Status in rabbitmq-server source package in Focal:
  Fix Committed
Status in rabbitmq-server source package in Groovy:
  Fix Released
Status in rabbitmq-server package in Debian:
  New

Bug description:
  The startup timeouts were recently adjusted and synchronized between
  the SysV and systemd startup files.

  https://github.com/rabbitmq/rabbitmq-server-release/pull/129

  The new startup files should be included in this package.

  [Impact]

  After starting the RabbitMQ server process, the startup script will
  wait for the server to start by calling `rabbitmqctl wait` and will
  time out after 10 s.

  The startup time of the server depends on how quickly the Mnesia
  database becomes available and the server will time out after
  `mnesia_table_loading_retry_timeout` ms times
  `mnesia_table_loading_retry_limit` retries. By default this wait is
  30,000 ms times 10 retries, i.e. 300 s.

  The mismatch between these two timeout values might lead to the
  startup script failing prematurely while the server is still waiting
  for the Mnesia tables.

  This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the
  `--timeout` option into the startup script. The default value for this
  timeout is set to 10 minutes (600 seconds).

  This change also updates the systemd service file to match the timeout
  values between the two service management methods.

  [Scope]

  Upstream patch: https://github.com/rabbitmq/rabbitmq-server-
  release/pull/129

  * Fix is not included in the Debian package
  * Fix is not included in any Ubuntu series

  * Groovy and Focal can apply the upstream patch as is
  * Bionic and Xenial need an additional fix in the systemd service file
    to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the
    `rabbitmq-server-wait` helper script.

  [Test Case]

  In a clustered setup with two nodes, A and B.

  1. create queue on A
  2. shut down B
  3. shut down A
  4. boot B

  The broker on B will wait for A. The systemd service will wait for 10
  seconds and then fail. Boot A and the rabbitmq-server process on B
  will complete startup.

  [Regression Potential]

  This change alters the behavior of the startup scripts when the Mnesia
  database takes long to become available. This might lead to failures
  further down the service dependency chain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions