[Bug 1789177] Re: RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)
OpenStack Infra
1789177 at bugs.launchpad.net
Fri Jul 8 13:49:14 UTC 2022
Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/749193
Committed: https://opendev.org/openstack/oslo.messaging/commit/b2acc6663f6c3f60e07cdeb1eae97fd1210a4d81
Submitter: "Zuul (22348)"
Branch: stable/stein
commit b2acc6663f6c3f60e07cdeb1eae97fd1210a4d81
Author: shenjiatong <yshxxsjt715 at gmail.com>
Date: Fri Jul 3 15:51:21 2020 +0800
Cancel consumer if queue down
Previously, we have switched to use default exchanges
to avoid excessive amounts of exchange not found messages.
But it does not actually solve the problem because
reply_* queue is already gone and agent will not receive callbacks.
after some debugging, I found under some circumstances
seems rabbitmq consumer does not receive basic cancel
signal when queue is already gone. This might due to
rabbitmq try to restart consumer when queue is down
(for example when split brain). In such cases,
it might be better to fail early.
by reading the code, seems like x-cancel-on-ha-failover
is not dedicated to mirror queues only, https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1894,
https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1926.
By failing early, in my own test setup,
I could solve a certain case of exchange not found problem.
Change-Id: I2ae53340783e4044dab58035bc0992dc08145b53
Related-bug: #1789177
Depends-On: https://review.opendev.org/#/c/747892/
(cherry picked from commit 196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1)
(cherry picked from commit 0a432c7fb107d04f7a41199fe9a8c4fbd344d009)
(cherry picked from commit 5de11fa752ab8e37b95b1785f4c71210bf473f0c)
** Tags added: in-stable-stein
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1789177
Title:
RabbitMQ fails to synchronize exchanges under high load (Note for
ubuntu: stein, rocky, queens(bionic) changes only fix compatibility
with fully patched releases)
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive mitaka series:
Triaged
Status in Ubuntu Cloud Archive queens series:
Fix Released
Status in Ubuntu Cloud Archive rocky series:
Fix Released
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in oslo.messaging:
Fix Released
Status in python-oslo.messaging package in Ubuntu:
Fix Released
Status in python-oslo.messaging source package in Xenial:
Invalid
Status in python-oslo.messaging source package in Bionic:
Fix Released
Bug description:
[Impact]
If there are many exchanges and queues, after failing over, rabbitmq-
server shows us error that exchanges are cannot be found.
Affected
Bionic (Queens)
Not affected
Focal
[Test Case]
1. deploy simple rabbitmq cluster
- https://pastebin.ubuntu.com/p/MR76VbMwY5/
2. juju ssh neutron-gateway/0
- for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
3. it would be better if we can add more exchanges, queues, bindings
- rabbitmq-plugins enable rabbitmq_management
- rabbitmqctl add_user test password
- rabbitmqctl set_user_tags test administrator
- rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
- https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
- for i in {1..2000}; do ./create.sh test_$i; done
4. restart rabbitmq-server service or shutdown machine and turn on several times.
5. you can see the exchange not found error
[1] create.sh (pasting here because pastebins don't last forever)
#!/bin/bash
rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password
[Where problems could occur]
1. every service which uses oslo.messaging need to be restarted.
2. Message transferring could be an issue
[Others]
Possible Workaround
1. for exchange not found issue,
- create exchange, queue, binding for problematic name in log
- then restart rabbitmq-server one by one
2. for queue crashed and failed to restart
- delete specific queue in log
// original description
Input:
- OpenStack Pike cluster with ~500 nodes
- DVR enabled in neutron
- Lots of messages
Scenario: failover of one rabbit node in a cluster
Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:
=ERROR REPORT==== 10-Aug-2018::17:24:37 ===
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'
Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.
Workaround: let the recovered node synchronize all exchanges - forbid
new connections with iptables rules for some time after failed node
gets online (30 sec)
Proposal: do not create new exchanges (use default) for all direct
messages - this also fixes the issue.
Is there a good reason for creating new exchanges for direct messages?
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1789177/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list