[Bug 1874075] Re: rabbitmq-server startup timeouts differ between SysV and systemd

Fri Jun 19 05:50:38 UTC 2020

@ddstreet asked me to take a look.

Step-by-step reproducer:

Set up a 2 node rabbitmq cluster using virtual machines.
Make the hostnames rabbitmq1 and rabbitmq2. 
Add each host to /etc/hosts in each vm.

Create the cluster:

1) On both hosts: sudo apt install rabbitmq-server
2) On host 1, copy the string in /var/lib/rabbitmq/.erlang.cookie and place it in /var/lib/rabbitmq/.erlang.cookie on host 2.
3) On host 2, restart the rabbitmq service: sudo systemctl restart rabbitmq-server
4) On host 2, stop the server: sudo rabbitmqctl stop_app
5) On host 2: sudo rabbitmqctl reset
6) On host 2: sudo rabbitmqctl join_cluster rabbit at rabbitmq1
7) On host 2: sudo rabbitmqctl start_app
8) On host 1: sudo rabbitmqctl cluster_status

You should see both rabbit hosts in the cluster.

Set up the queues:

On host 1:

1) sudo rabbitmqctl add_user tester linux
2) sudo rabbitmqctl add_vhost tester
3) sudo rabbitmqctl set_permissions -p tester tester ".*" ".*" ".*"
4) sudo rabbitmqctl set_policy -p tester HA ".*" '{"ha-mode": "all"}'
5) sudo rabbitmqctl list_permissions -p tester
6) sudo rabbitmqctl list_policies -p tester

On VM host:

1) git clone https://github.com/nicolasbock/rabbitmq-test.git
2) virtualenv venv
3) . venv/bin/activate
4) pip install -r requirements.txt
5) python setup.py install
6) ./test-rabbit.py <host 1 IP addr> --send 'message 1'
7) ./test-rabbit.py <host 1 IP addr> --list

On host 1:

1) sudo rabbitmqctl list_queues -p tester name pid slave_pids
Listing queues
test_queue	<rabbit at rabbitmq1.1.657.0>	[<rabbit at rabbitmq2.2.979.0>]

In my case, rabbitmq1 is the primary owner of the queue denoted with <>,
with rabbitmq2 being a slave, denoted with [].

We want to shut down the primary host, so the slave gets promoted to
being primary.

Shut down rabbitmq1.

On host 2, confirm it has become primary with:

1) sudo rabbitmqctl list_queues -p tester name pid slave_pids
Listing queues
test_queue	<rabbit at rabbitmq2.2.979.0>	[]

Send a new message, to push the queue ahead of what rabbitmq1 currently
knows about.

On VM host:

1) ./test-rabbit.py <host 2 IP addr> --send 'message 2'
2) ./test-rabbit.py <host2 IP addr> --list

Shut rabbitmq2 down, so all VMs are off.

Attempt to boot rabbitmq1 now. rabbitmq1 will assume it is behind, and
needs to wait for rabbitmq2 to come online before we continue. This is
where the issue occurs. Check the status of rabbitmq-server.service on
rabbitmq1.

1) sudo systemctl status rabbitmq-server.service

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to rabbitmq-server in Ubuntu.
https://bugs.launchpad.net/bugs/1874075

Title:
  rabbitmq-server startup timeouts differ between SysV and systemd

Status in rabbitmq-server package in Ubuntu:
  Fix Released
Status in rabbitmq-server source package in Xenial:
  Fix Committed
Status in rabbitmq-server source package in Bionic:
  Fix Committed
Status in rabbitmq-server source package in Eoan:
  Won't Fix
Status in rabbitmq-server source package in Focal:
  Fix Committed
Status in rabbitmq-server source package in Groovy:
  Fix Released
Status in rabbitmq-server package in Debian:
  New

Bug description:
  The startup timeouts were recently adjusted and synchronized between
  the SysV and systemd startup files.

  https://github.com/rabbitmq/rabbitmq-server-release/pull/129

  The new startup files should be included in this package.

  [Impact]

  After starting the RabbitMQ server process, the startup script will
  wait for the server to start by calling `rabbitmqctl wait` and will
  time out after 10 s.

  The startup time of the server depends on how quickly the Mnesia
  database becomes available and the server will time out after
  `mnesia_table_loading_retry_timeout` ms times
  `mnesia_table_loading_retry_limit` retries. By default this wait is
  30,000 ms times 10 retries, i.e. 300 s.

  The mismatch between these two timeout values might lead to the
  startup script failing prematurely while the server is still waiting
  for the Mnesia tables.

  This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the
  `--timeout` option into the startup script. The default value for this
  timeout is set to 10 minutes (600 seconds).

  This change also updates the systemd service file to match the timeout
  values between the two service management methods.

  [Scope]

  Upstream patch: https://github.com/rabbitmq/rabbitmq-server-
  release/pull/129

  * Fix is not included in the Debian package
  * Fix is not included in any Ubuntu series

  * Groovy and Focal can apply the upstream patch as is
  * Bionic and Xenial need an additional fix in the systemd service file
    to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the
    `rabbitmq-server-wait` helper script.

  [Test Case]

  In a clustered setup with two nodes, A and B.

  1. create queue on A
  2. shut down B
  3. shut down A
  4. boot B

  The broker on B will wait for A. The systemd service will wait for 10
  seconds and then fail. Boot A and the rabbitmq-server process on B
  will complete startup.

  [Regression Potential]

  This change alters the behavior of the startup scripts when the Mnesia
  database takes long to become available. This might lead to failures
  further down the service dependency chain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions