[Bug 1789177] Re: RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)

Tue Apr 6 13:48:30 UTC 2021

@Łukasz, it's a little awkward. The single patch does not fix the
failure to synchronize exchanges under high load (based on Seyeong's
testing) however it does fix compatibility with releases that have been
fully patched. I've updated the description, hopefully that helps a bit
to clear this up.

** Summary changed:

- RabbitMQ fails to synchronize exchanges under high load
+ RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1789177

Title:
  RabbitMQ fails to synchronize exchanges under high load (Note for
  ubuntu: stein, rocky, queens(bionic) changes only fix compatibility
  with fully patched releases)

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in oslo.messaging:
  Fix Released
Status in python-oslo.messaging package in Ubuntu:
  Fix Released
Status in python-oslo.messaging source package in Xenial:
  In Progress
Status in python-oslo.messaging source package in Bionic:
  Triaged

Bug description:
  [Impact]

  If there are many exchanges and queues, after failing over, rabbitmq-
  server shows us error that exchanges are cannot be found.

  Affected
   Bionic (Queens)
  Not affected
   Focal

  [Test Case]

  1. deploy simple rabbitmq cluster
  - https://pastebin.ubuntu.com/p/MR76VbMwY5/
  2. juju ssh neutron-gateway/0
  - for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
  3. it would be better if we can add more exchanges, queues, bindings
  - rabbitmq-plugins enable rabbitmq_management
  - rabbitmqctl add_user test password
  - rabbitmqctl set_user_tags test administrator
  - rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
  - https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
  - for i in {1..2000}; do ./create.sh test_$i; done

  4. restart rabbitmq-server service or shutdown machine and turn on several times.
  5. you can see the exchange not found error

  [1] create.sh (pasting here because pastebins don't last forever)
  #!/bin/bash

  rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
  rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
  rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password

  [Where problems could occur]
  1. every service which uses oslo.messaging need to be restarted.
  2. Message transferring could be an issue

  [Others]

  Possible Workaround

  1. for exchange not found issue,
  - create exchange, queue, binding for problematic name in log
  - then restart rabbitmq-server one by one

  2. for queue crashed and failed to restart
  - delete specific queue in log

  // original description

  Input:
   - OpenStack Pike cluster with ~500 nodes
   - DVR enabled in neutron
   - Lots of messages

  Scenario: failover of one rabbit node in a cluster

  Issue: after failed rabbit node gets back online some rpc communications appear broken
  Logs from rabbit:

  =ERROR REPORT==== 10-Aug-2018::17:24:37 ===
  Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
  operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

  Investigation:
  After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

  Workaround: let the recovered node synchronize all exchanges - forbid
  new connections with iptables rules for some time after failed node
  gets online (30 sec)

  Proposal: do not create new exchanges (use default) for all direct
  messages - this also fixes the issue.

  Is there a good reason for creating new exchanges for direct messages?

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1789177/+subscriptions