[Bug 1318721] Re: RPC timeout in all neutron agents

Chris J Arges 1318721 at bugs.launchpad.net
Fri Jan 22 14:34:17 UTC 2016


Hello mouadino, or anyone else affected,

Accepted oslo.messaging into trusty-proposed. The package will build now
and be available at
https://launchpad.net/ubuntu/+source/oslo.messaging/1.3.0-0ubuntu1.4 in
a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
enable and use -proposed.  Your feedback will aid us getting this update
out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, and change the tag
from verification-needed to verification-done. If it does not fix the
bug for you, please add a comment stating that, and change the tag to
verification-failed.  In either case, details of your testing will help
us make a better decision.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance!

** Changed in: oslo.messaging (Ubuntu Trusty)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1318721

Title:
  RPC timeout in all neutron agents

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive icehouse series:
  In Progress
Status in Ubuntu Cloud Archive juno series:
  In Progress
Status in neutron:
  Invalid
Status in oslo.messaging:
  Fix Released
Status in neutron package in Ubuntu:
  Invalid
Status in oslo.messaging package in Ubuntu:
  Invalid
Status in neutron source package in Trusty:
  Fix Committed
Status in oslo.messaging source package in Trusty:
  Fix Committed

Bug description:
  In the logs the first traceback that happen is this:

  [-] Unexpected exception occurred 1 time(s)... retrying.
  Traceback (most recent call last):
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
      return infunc(*args, **kwargs)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 741, in _consumer_thread

    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 732, in consume
      @excutils.forever_retry_uncaught_exceptions
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 660, in iterconsume
      try:
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 590, in ensure
      def close(self):
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 531, in reconnect
      # to return an error not covered by its transport
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 513, in _connect
      Will retry up to self.max_retries number of times.
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 150, in reconnect
      use the callback passed during __init__()
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 508, in declare
      self.queue_bind(nowait)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 541, in queue_bind
      self.binding_arguments, nowait=nowait)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 551, in bind_to
      nowait=nowait)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 1003, in queue_bind
      (50, 21),  # Channel.queue_bind_ok
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 68, in wait
      return self.dispatch_method(method_sig, args, content)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 86, in dispatch_method
      return amqp_method(self, args)
    File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
      reply_code, reply_text, (class_id, method_id), ChannelError,
  NotFound: Queue.bind: (404) NOT_FOUND - no exchange 'reply_8f19344531b448c89d412ee97ff11e79' in vhost '/'

  Than an RPC Timeout is raised each second in all the agents

  ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
  TRACE neutron.agent.l3_agent Traceback (most recent call last):
  TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 702, in _rpc_loop
  TRACE neutron.agent.l3_agent     self.context, router_ids)
  TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 79, in get_routers
  TRACE neutron.agent.l3_agent     topic=self.topic)
  TRACE neutron.agent.l3_agent   File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/proxy.py", line 130, in call
  TRACE neutron.agent.l3_agent     exc.info, real_topic, msg.get('method'))
  TRACE neutron.agent.l3_agent Timeout: Timeout while waiting on RPC response - topic: "q-l3-plugin", RPC method: "sync_routers" info: "<unknown>"

  This actually make the agent useless until they are all restarted.

  An analyze of what's going on coming soon :)

  
  ---------------------------

  [Impact]

  This patch addresses an issue when a RabbitMQ cluster node goes down,
  OpenStack services try to reconnect to another RabbitMQ node and then
  re-create everything from scratch , and due to the 'auto-delete' flag
  is set, race condition happened between re-create and delete on
  Exchange, Queues, Bindings, which caused nova-compute and neutron
  agents are down.

  [Test Case]

  Note steps are for trusty-icehouse, including latest oslo.messaging
  library (1.3.0-0ubuntu1.2 at the time of this writing).

  Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly
  kill one of the rabbit nodes (e.g.  sudo service rabbitmq-server stop,
  etc). Observe that the nova services and neutron agents do detect that
  the node went down and report that they are reconnected, but messages
  are still reporting as timed out, nova service-list/neutron agent-list
  still reports compute and agents as down, etc.

  [Regression Potential]

  None.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1318721/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list