[Bug 1789177] Re: RabbitMQ fails to synchronize exchanges under high load

Tue Mar 23 06:56:06 UTC 2021

** Description changed:

  [Impact]

  If there are many exchanges and queues, after failing over, rabbitmq-
  server shows us error that exchanges are cannot be found.

  Affected
   Bionic (Queens)
  Not affected
   Focal

  [Test Case]

  1. deploy simple rabbitmq cluster
  - https://pastebin.ubuntu.com/p/MR76VbMwY5/
  2. juju ssh neutron-gateway/0
  - for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
  3. it would be better if we can add more exchanges, queues, bindings
  - rabbitmq-plugins enable rabbitmq_management
  - rabbitmqctl add_user test password
  - rabbitmqctl set_user_tags test administrator
  - rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
  - https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
  - for i in {1..2000}; do ./create.sh test_$i; done

  4. restart rabbitmq-server service or shutdown machine and turn on several times.
  5. you can see the exchange not found error

- 
  [1] create.sh (pasting here because pastebins don't last forever)
  #!/bin/bash

  rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
  rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
  rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password

- 
  [Where problems could occur]
  1. every service which uses oslo.messaging need to be restarted.
  2. Message transferring could be an issue

  [Others]
+ 
+ Possible Workaround
+ 
+ 1. for exchange not found issue,
+ - create exchange, queue, binding for problematic name in log
+ - then restart rabbitmq-server one by one
+ 
+ 2. for queue crashed and failed to restart
+ - delete specific queue in log
+ 

  // original description

  Input:
   - OpenStack Pike cluster with ~500 nodes
   - DVR enabled in neutron
   - Lots of messages

  Scenario: failover of one rabbit node in a cluster

  Issue: after failed rabbit node gets back online some rpc communications appear broken
  Logs from rabbit:

  =ERROR REPORT==== 10-Aug-2018::17:24:37 ===
  Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
  operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

  Investigation:
  After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

  Workaround: let the recovered node synchronize all exchanges - forbid
  new connections with iptables rules for some time after failed node gets
  online (30 sec)

  Proposal: do not create new exchanges (use default) for all direct
  messages - this also fixes the issue.

  Is there a good reason for creating new exchanges for direct messages?

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1789177

Title:
  RabbitMQ fails to synchronize exchanges under high load

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in Ubuntu Cloud Archive rocky series:
  Fix Committed
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in oslo.messaging:
  Fix Released
Status in python-oslo.messaging package in Ubuntu:
  Fix Released
Status in python-oslo.messaging source package in Xenial:
  In Progress
Status in python-oslo.messaging source package in Bionic:
  Triaged

Bug description:
  [Impact]

  If there are many exchanges and queues, after failing over, rabbitmq-
  server shows us error that exchanges are cannot be found.

  Affected
   Bionic (Queens)
  Not affected
   Focal

  [Test Case]

  1. deploy simple rabbitmq cluster
  - https://pastebin.ubuntu.com/p/MR76VbMwY5/
  2. juju ssh neutron-gateway/0
  - for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
  3. it would be better if we can add more exchanges, queues, bindings
  - rabbitmq-plugins enable rabbitmq_management
  - rabbitmqctl add_user test password
  - rabbitmqctl set_user_tags test administrator
  - rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
  - https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
  - for i in {1..2000}; do ./create.sh test_$i; done

  4. restart rabbitmq-server service or shutdown machine and turn on several times.
  5. you can see the exchange not found error

  [1] create.sh (pasting here because pastebins don't last forever)
  #!/bin/bash

  rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
  rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
  rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password

  [Where problems could occur]
  1. every service which uses oslo.messaging need to be restarted.
  2. Message transferring could be an issue

  [Others]

  Possible Workaround

  1. for exchange not found issue,
  - create exchange, queue, binding for problematic name in log
  - then restart rabbitmq-server one by one

  2. for queue crashed and failed to restart
  - delete specific queue in log

  // original description

  Input:
   - OpenStack Pike cluster with ~500 nodes
   - DVR enabled in neutron
   - Lots of messages

  Scenario: failover of one rabbit node in a cluster

  Issue: after failed rabbit node gets back online some rpc communications appear broken
  Logs from rabbit:

  =ERROR REPORT==== 10-Aug-2018::17:24:37 ===
  Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
  operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

  Investigation:
  After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

  Workaround: let the recovered node synchronize all exchanges - forbid
  new connections with iptables rules for some time after failed node
  gets online (30 sec)

  Proposal: do not create new exchanges (use default) for all direct
  messages - this also fixes the issue.

  Is there a good reason for creating new exchanges for direct messages?

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1789177/+subscriptions