[Bug 1366004] Re: Sleep in RPC reconnect is affected by NTPd on neutron services' startup
James Page
james.page at ubuntu.com
Thu Sep 8 11:54:09 UTC 2016
** Changed in: neutron (Ubuntu)
Status: New => Triaged
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1366004
Title:
Sleep in RPC reconnect is affected by NTPd on neutron services'
startup
Status in neutron package in Ubuntu:
Triaged
Bug description:
Under certain situations, the neutron server and all its agents will
hang for a while on startup and become unresponsive.
The issue can be demonstrated by doing neutron agent-list shortly
after reboot:
$ neutron agent-list
Connection to neutron failed: Maximum attempts reached
or
$ neutron agent-list
+--------------------------------------+--------------------+-----------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+-----------+-------+----------------+
| 4306e16a-9786-4772-acbd-93c9911a6971 | Linux bridge agent | myhost | xxx | True |
| 9635d63a-1606-4baf-9093-e99f3eaf0552 | DHCP agent | myhost | xxx | True |
| 9cfb2966-5cc0-4528-9bf6-388703d0d203 | Metadata agent | myhost | xxx | True |
| e6bbaf50-adfb-47fd-9e22-e27624e16704 | L3 agent | myhost | xxx | True |
+--------------------------------------+--------------------+-----------+-------+----------------+
as opposed to them being reported alive.
Note that is its expected to see output (1) - server isn't started, or
output (2) - [some] agents don't report being alive, for a while, but
not too long, certainly not minutes.
The place where they hang seems to always be the RPC connections'
setup to the AMQP server, in
neutron.openstack.common.rpc.impl_kombu.Connection.reconnect, on the
time.sleep(sleep_time) - final statement of that method.
I strongly believe it only happens if a NTP time step happens while
the reconnection attempts are being made. The sleep_time variable is
set to 1 initially and grows by 2 seconds on each retry, for a maximum
of 30 retries.
However, it has been observed by stracing the process, that the
underlying select(2) call, using which time.sleep is implemented in
Python, seems to operate with a timeout parameter of more than 15
minutes on the affected machine, which approximately coincides with
the NTP stepping difference for that machine. Associated neutron
server/agent logs show a similar step of 15 minutes between messages
at some point. If an agent run happens to not be affected, the message
with the time skip is from elsewhere in the codebase.
I'd be seriously baffling if GNU C library's select(2) were truly
affected by setting the clock, but I have no other explanation at this
time.
A way to verify that the bad condition is happening is, for example on
the DHCP agent:
# get PID of the agent and verify it's running:
$ ps aux | grep neutron-dhcp-agent
neutron 2740 0.3 0.0 110228 34976 ? Ss 16:38 0:00 /usr/bin/python /usr/bin/neutron-dhcp-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log
# check that it's waiting in select with a large seconds parameter.
$ sudo strace -p 2740
Process 2740 attached
select(0, NULL, NULL, NULL, {753, 918063}^CProcess 2740 detached
<detached ...>
The example above shows remaining timeout of almost 753.9 seconds. The
value reported is not the original value, it decreases as you strace
the process again.
A workaround to this issue would be forcing the neutron upstart jobs
to only run after ntp steps the clock. Since ntp is an sysv job and
not an upstart job, the easiest way to achieve this is waiting for all
sysv jobs to finish, which coincides with the 'rc' upstart job's
stopping.
Hence, the proposed workaround is to edit the neutron upstart jobs:
sudo sed -i -e '/start on/s/$/ and stopped rc/' /etc/init/neutron*conf
Description: Ubuntu 14.04.1 LTS
Release: 14.04
ii neutron-common 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - common
ii neutron-dhcp-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - DHCP agent
ii neutron-l3-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - l3 agent
ii neutron-metadata-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - metadata agent
ii neutron-plugin-linuxbridge 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - linuxbridge plugin
ii neutron-plugin-linuxbridge-agent 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - linuxbridge plugin agent
ii neutron-plugin-ml2 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - ML2 plugin
ii neutron-server 1:2014.1.2-0ubuntu1.1 all Neutron is a virtual network service for Openstack - server
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1366004/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list