[Bug 1789177] Re: RabbitMQ fails to synchronize exchanges under high load

Edward Hope-Morley 1789177 at bugs.launchpad.net
Tue Feb 2 12:55:12 UTC 2021


e.g. 2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(fc9335298407444ab0e7000d3fe2f4b7). Skipping it.
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 368, in _callback
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in __call__
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit     unique_id = self.msg_id_cache.check_duplicate_message(message)
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py", line 121, in check_duplicate_message
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit     raise rpc_common.DuplicateMessageError(msg_id=msg_id)
2021-02-02 12:07:53.930 27349 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(fc9335298407444ab0e7000d3fe2f4b7). Skipping it.


and 

2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-e53cf710-52f8-4790-bb7a-9968807f842f - - - - -] Error while processing VIF ports: MessagingTimeout: Timed out waiting for a reply to message ID 06bc2386bc6b42f2ad48ebc615
7b3ec6
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2163, in rpc_loop
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     port_info, provisioning_needed)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158, in wrapper
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = f(*args, **kwargs)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1740, in process_network_ports
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     failed_devices['added'] |= self._bind_devices(need_binding_devices)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 892, in _bind_devices
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.conf.host, agent_restarted=agent_restarted)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 165, in update_device_list
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     agent_restarted=agent_restarted)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 185, in call
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     time.sleep(wait)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.force_reraise()
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     six.reraise(self.type_, self.value, self.tb)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 162, in call
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return self._original_context.call(ctxt, method, **kwargs)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2021-02-02 12:05:54.869 27349 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 06bc2386bc6b42f2ad48ebc6157b3ec6

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1789177

Title:
  RabbitMQ fails to synchronize exchanges under high load

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive mitaka series:
  New
Status in Ubuntu Cloud Archive queens series:
  Fix Committed
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in oslo.messaging:
  Fix Released
Status in python-oslo.messaging package in Ubuntu:
  Fix Released
Status in python-oslo.messaging source package in Xenial:
  In Progress
Status in python-oslo.messaging source package in Bionic:
  Fix Released

Bug description:
  [Impact]

  If there are many exchanges and queues, after failing over, rabbitmq-
  server shows us error that exchanges are cannot be found.

  Affected
   Bionic (Queens)
  Not affected
   Focal


  [Test Case]

  1. deploy simple rabbitmq cluster
  - https://pastebin.ubuntu.com/p/MR76VbMwY5/
  2. juju ssh neutron-gateway/0
  - for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
  3. it would be better if we can add more exchanges, queues, bindings
  - rabbitmq-plugins enable rabbitmq_management 
  - rabbitmqctl add_user test password 
  - rabbitmqctl set_user_tags test administrator
  - rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*" 
  - https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh)
  - for i in {1..2000}; do ./create.sh test_$i; done

  4. restart rabbitmq-server service or shutdown machine and turn on several times.
  5. you can see the exchange not found error

  
  [Where problems could occur]
  1. every service which uses oslo.messaging need to be restarted.
  2. Message transferring could be an issue

  
  [Others]

  // original description

  Input:
   - OpenStack Pike cluster with ~500 nodes
   - DVR enabled in neutron
   - Lots of messages

  Scenario: failover of one rabbit node in a cluster

  Issue: after failed rabbit node gets back online some rpc communications appear broken
  Logs from rabbit:

  =ERROR REPORT==== 10-Aug-2018::17:24:37 ===
  Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
  operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

  Investigation:
  After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

  Workaround: let the recovered node synchronize all exchanges - forbid
  new connections with iptables rules for some time after failed node
  gets online (30 sec)

  Proposal: do not create new exchanges (use default) for all direct
  messages - this also fixes the issue.

  Is there a good reason for creating new exchanges for direct messages?

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1789177/+subscriptions



More information about the Ubuntu-sponsors mailing list