[Bug 1318721] [NEW] RPC timeout in all neutron agents

Thu Dec 17 11:54:05 UTC 2015

You have been subscribed to a public bug:

In the logs the first traceback that happen is this:

[-] Unexpected exception occurred 1 time(s)... retrying.
Traceback (most recent call last):
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
    return infunc(*args, **kwargs)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 741, in _consumer_thread

  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 732, in consume
    @excutils.forever_retry_uncaught_exceptions
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 660, in iterconsume
    try:
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 590, in ensure
    def close(self):
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 531, in reconnect
    # to return an error not covered by its transport
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 513, in _connect
    Will retry up to self.max_retries number of times.
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 150, in reconnect
    use the callback passed during __init__()
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 508, in declare
    self.queue_bind(nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 541, in queue_bind
    self.binding_arguments, nowait=nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 551, in bind_to
    nowait=nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 1003, in queue_bind
    (50, 21),  # Channel.queue_bind_ok
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 68, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 86, in dispatch_method
    return amqp_method(self, args)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.bind: (404) NOT_FOUND - no exchange 'reply_8f19344531b448c89d412ee97ff11e79' in vhost '/'

Than an RPC Timeout is raised each second in all the agents

ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
TRACE neutron.agent.l3_agent Traceback (most recent call last):
TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 702, in _rpc_loop
TRACE neutron.agent.l3_agent     self.context, router_ids)
TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 79, in get_routers
TRACE neutron.agent.l3_agent     topic=self.topic)
TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/proxy.py", line 130, in call
TRACE neutron.agent.l3_agent     exc.info, real_topic, msg.get('method'))
TRACE neutron.agent.l3_agent Timeout: Timeout while waiting on RPC response - topic: "q-l3-plugin", RPC method: "sync_routers" info: "<unknown>"

This actually make the agent useless until they are all restarted.

An analyze of what's going on coming soon :)

---------------------------

[Impact]

This patch addresses an issue when a RabbitMQ cluster node goes down,
OpenStack services try to reconnect to another RabbitMQ node and then
re-create everything from scratch , and due to the 'auto-delete' flag is
set, race condition happened between re-create and delete on Exchange,
Queues, Bindings, which caused nova-compute and neutron agents are down.

[Test Case]

Note steps are for trusty-icehouse, including latest oslo.messaging
library (1.3.0-0ubuntu1.2 at the time of this writing).

Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly
kill one of the rabbit nodes (e.g.  sudo service rabbitmq-server stop,
etc). Observe that the nova services and neutron agents do detect that
the node went down and report that they are reconnected, but messages
are still reporting as timed out, nova service-list/neutron agent-list
still reports compute and agents as down, etc.

[Regression Potential]

None.

** Affects: neutron
     Importance: Undecided
         Status: Invalid

** Affects: oslo.messaging
     Importance: Medium
     Assignee: Dr. Jens Rosenboom (j-rosenboom-j)
         Status: Fix Released

** Affects: neutron (Ubuntu)
     Importance: Medium
         Status: New

** Tags: patch
-- 
RPC timeout in all neutron agents
https://bugs.launchpad.net/bugs/1318721
You received this bug notification because you are a member of Ubuntu OpenStack, which is subscribed to neutron in Ubuntu.