[Bug 1874075] Re: rabbitmq-server startup timeouts differ between SysV and systemd

Mon Jun 29 03:03:03 UTC 2020

I agree that the fixes in -proposed for Focal and Groovy are good as is,
since it works nicely with type=notify on those systems.

I spent some time testing on Bionic. I can confirm that the existence of
ExecStartPost is for the very reason that Nicolas describes, that
RabbitMQ will return 0 regardless if the server started correctly or
not.

When I removed ExecStartPost, and set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and reproduced, once RabbitMQ did 10x 30000ms timeouts, it exited 0, and
systemd assumed it was a clean and expected stop, and the service was
stopped with ExitSuccess. The result being the service dies after 5
minutes. Its an improvement over 90 seconds, but not quite 10 min gold
standard.

I then added ExecStartPost back in, and added the modification to the
wrapper script that Nicolas put forward, aka /usr/lib/rabbitmq/bin
/rabbitmq-server-wait:

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I then went and reproduced. systemd now doesn't treat the service as
started until it actually joins the cluster, instead it is in the
activating state while it is waiting for the cluster leader to turn up.

I left the VM in this activating state for quite some time. Each time
the 10x 3000ms timeouts finish, the wrapper script exits with failure,
and systemd restarts the service, and we go back to another 10x 3000ms
cycle. From my testing it never stops after 2 rounds, instead, it goes
forever, likely due to Restart=on-failure and RestartSec=10 being set.

This works wonderfully, and fixes the problem. I'm sorry I ever doubted
your solution Nicolas.

While I do think the better solution is to set type=notify, on Bionic it
would require the socat dependency to actually send the notification to
systemd, and as Dan mentioned before, that is unacceptable for a SRU,
and the apt upgrade reasoning makes sense. We can enjoy perfect systemd
controlled resilience on focal onwards.

On Bionic, I think staying with type=simple is fine, and we just need to
set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and make the below change to /usr/lib/rabbitmq/bin/rabbitmq-server-wait

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I don't think we need to change the default
mnesia_table_loading_retry_limit or mnesia_table_loading_retry_timeout
values, since if we set the systemd Restart=on-failure setting, once the
wrapper script dies at the 5 minute timeout, the service will just be
restarted and we begin anew.

I'll make a test package now, and see how it goes.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to rabbitmq-server in Ubuntu.
https://bugs.launchpad.net/bugs/1874075

Title:
  rabbitmq-server startup timeouts differ between SysV and systemd

Status in rabbitmq-server package in Ubuntu:
  Fix Released
Status in rabbitmq-server source package in Xenial:
  Fix Committed
Status in rabbitmq-server source package in Bionic:
  Fix Committed
Status in rabbitmq-server source package in Eoan:
  Won't Fix
Status in rabbitmq-server source package in Focal:
  Fix Committed
Status in rabbitmq-server source package in Groovy:
  Fix Released
Status in rabbitmq-server package in Debian:
  New

Bug description:
  The startup timeouts were recently adjusted and synchronized between
  the SysV and systemd startup files.

  https://github.com/rabbitmq/rabbitmq-server-release/pull/129

  The new startup files should be included in this package.

  [Impact]

  After starting the RabbitMQ server process, the startup script will
  wait for the server to start by calling `rabbitmqctl wait` and will
  time out after 10 s.

  The startup time of the server depends on how quickly the Mnesia
  database becomes available and the server will time out after
  `mnesia_table_loading_retry_timeout` ms times
  `mnesia_table_loading_retry_limit` retries. By default this wait is
  30,000 ms times 10 retries, i.e. 300 s.

  The mismatch between these two timeout values might lead to the
  startup script failing prematurely while the server is still waiting
  for the Mnesia tables.

  This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the
  `--timeout` option into the startup script. The default value for this
  timeout is set to 10 minutes (600 seconds).

  This change also updates the systemd service file to match the timeout
  values between the two service management methods.

  [Scope]

  Upstream patch: https://github.com/rabbitmq/rabbitmq-server-
  release/pull/129

  * Fix is not included in the Debian package
  * Fix is not included in any Ubuntu series

  * Groovy and Focal can apply the upstream patch as is
  * Bionic and Xenial need an additional fix in the systemd service file
    to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the
    `rabbitmq-server-wait` helper script.

  [Test Case]

  In a clustered setup with two nodes, A and B.

  1. create queue on A
  2. shut down B
  3. shut down A
  4. boot B

  The broker on B will wait for A. The systemd service will wait for 10
  seconds and then fail. Boot A and the rabbitmq-server process on B
  will complete startup.

  [Regression Potential]

  This change alters the behavior of the startup scripts when the Mnesia
  database takes long to become available. This might lead to failures
  further down the service dependency chain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions