[Bug 1800957] Fix included in openstack/oslo.messaging 9.5.0
OpenStack Infra
1800957 at bugs.launchpad.net
Tue Feb 26 00:03:43 UTC 2019
This issue was fixed in the openstack/oslo.messaging 9.5.0 release.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to oslo.messaging in Ubuntu.
https://bugs.launchpad.net/bugs/1800957
Title:
Upgrading to pike version causes rabbit timeouts with ssl
Status in oslo.messaging:
Fix Released
Status in oslo.messaging package in Ubuntu:
New
Bug description:
We have discovered an issue when upgrading our clouds from ocata to
pike.
oslo.messaging versions
ocata: 5.17.1
pike: 5.30.0
python-amqp versions
ocata: 1.4.9
pike: 2.1.4
On upgrading to pike we get several issues with neutron-dhcp-agent and
nova-compute.
The error we see is:
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent [req-79e8c605-055e-4354-b749-7dd7baabf864 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 740, in _report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 92, in report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=self.retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent timeout=timeout, retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent result = self._waiter.wait(msg_id, timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3
Steps to reproduce are:
Start neutron-dhcp-agent with no networks being hosted on it.
agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.
Now schedule a network onto the agent
Now the agent sync times out.
I can see the reply queue in rabbit and it starts to fill up with
unacked messages and the agent starts to produce the stack trace above
consistently.
Removing the network and restarting the agent gets the agent reporting
normally again.
Now if I do the same thing except don't use the rabbit ssl port and
setting everything works flawlessly.
We also see this behaviour with nova-compute. Something happens and
then all messages get stuck in unack and timeouts appear in the log.
I suspect this could be more to do with the python-amqp version but I'm not certain.
We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.
To manage notifications about this bug go to:
https://bugs.launchpad.net/oslo.messaging/+bug/1800957/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list