[Bug 1800957] Re: Upgrading to pike version causes rabbit timeouts with ssl

Launchpad Bug Tracker 1800957 at bugs.launchpad.net
Wed Oct 30 11:32:19 UTC 2019


Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: oslo.messaging (Ubuntu)
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1800957

Title:
  Upgrading to pike version causes rabbit timeouts with ssl

Status in oslo.messaging:
  Fix Released
Status in oslo.messaging package in Ubuntu:
  Confirmed

Bug description:
  We have discovered an issue when upgrading our clouds from ocata to
  pike.

  oslo.messaging versions
  ocata: 5.17.1
  pike:  5.30.0

  python-amqp versions
  ocata: 1.4.9
  pike:  2.1.4

  On upgrading to pike we get several issues with neutron-dhcp-agent and
  nova-compute.

  The error we see is:

  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent [req-79e8c605-055e-4354-b749-7dd7baabf864 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 740, in _report_state
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 92, in report_state
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     retry=self.retry)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     timeout=timeout, retry=retry)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     retry=retry)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     result = self._waiter.wait(msg_id, timeout)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
  2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3

  Steps to reproduce are:

  Start neutron-dhcp-agent with no networks being hosted on it.
  agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.

  Now schedule a network onto the agent
  Now the agent sync times out.

  I can see the reply queue in rabbit and it starts to fill up with
  unacked messages and the agent starts to produce the stack trace above
  consistently.

  Removing the network and restarting the agent gets the agent reporting
  normally again.

  Now if I do the same thing except don't use the rabbit ssl port and
  setting everything works flawlessly.

  We also see this behaviour with nova-compute. Something happens and
  then all messages get stuck in unack and timeouts appear in the log.

  I suspect this could be more to do with the python-amqp version but I'm not certain.
  We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/oslo.messaging/+bug/1800957/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list