[Bug 1789177] Re: RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)
Łukasz Zemczak
1789177 at bugs.launchpad.net
Mon Jun 7 14:28:19 UTC 2021
Hello Oleg, or anyone else affected,
Accepted python-oslo.messaging into bionic-proposed. The package will
build now and be available at https://launchpad.net/ubuntu/+source
/python-oslo.messaging/5.35.0-0ubuntu4 in a few hours, and then in the
-proposed repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: python-oslo.messaging (Ubuntu Bionic)
Status: Triaged => Fix Committed
** Tags removed: verification-done-bionic
** Tags added: verification-needed verification-needed-bionic
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1789177
Title:
RabbitMQ fails to synchronize exchanges under high load (Note for
ubuntu: stein, rocky, queens(bionic) changes only fix compatibility
with fully patched releases)
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive mitaka series:
Triaged
Status in Ubuntu Cloud Archive queens series:
In Progress
Status in Ubuntu Cloud Archive rocky series:
Fix Released
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in oslo.messaging:
Fix Released
Status in python-oslo.messaging package in Ubuntu:
Fix Released
Status in python-oslo.messaging source package in Xenial:
In Progress
Status in python-oslo.messaging source package in Bionic:
Fix Committed
Bug description:
[Impact]
If there are many exchanges and queues, after failing over, rabbitmq-
server shows us error that exchanges are cannot be found.
Affected
Bionic (Queens)
Not affected
Focal
[Test Case]
1. deploy simple rabbitmq cluster
- https://pastebin.ubuntu.com/p/MR76VbMwY5/
2. juju ssh neutron-gateway/0
- for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
3. it would be better if we can add more exchanges, queues, bindings
- rabbitmq-plugins enable rabbitmq_management
- rabbitmqctl add_user test password
- rabbitmqctl set_user_tags test administrator
- rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
- https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
- for i in {1..2000}; do ./create.sh test_$i; done
4. restart rabbitmq-server service or shutdown machine and turn on several times.
5. you can see the exchange not found error
[1] create.sh (pasting here because pastebins don't last forever)
#!/bin/bash
rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password
[Where problems could occur]
1. every service which uses oslo.messaging need to be restarted.
2. Message transferring could be an issue
[Others]
Possible Workaround
1. for exchange not found issue,
- create exchange, queue, binding for problematic name in log
- then restart rabbitmq-server one by one
2. for queue crashed and failed to restart
- delete specific queue in log
// original description
Input:
- OpenStack Pike cluster with ~500 nodes
- DVR enabled in neutron
- Lots of messages
Scenario: failover of one rabbit node in a cluster
Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:
=ERROR REPORT==== 10-Aug-2018::17:24:37 ===
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'
Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.
Workaround: let the recovered node synchronize all exchanges - forbid
new connections with iptables rules for some time after failed node
gets online (30 sec)
Proposal: do not create new exchanges (use default) for all direct
messages - this also fixes the issue.
Is there a good reason for creating new exchanges for direct messages?
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1789177/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list