[Bug 1366004] [NEW] Sleep in RPC reconnect is affected by NTPd on neutron services' startup

Fri Sep 5 13:01:34 UTC 2014

Public bug reported:

Under certain situations, the neutron server and all its agents will
hang for a while on startup and become unresponsive.

The issue can be demonstrated by doing neutron agent-list shortly after
reboot:

$ neutron agent-list
Connection to neutron failed: Maximum attempts reached

or

$ neutron agent-list
+--------------------------------------+--------------------+-----------+-------+----------------+
| id                                   | agent_type         | host      | alive | admin_state_up |
+--------------------------------------+--------------------+-----------+-------+----------------+
| 4306e16a-9786-4772-acbd-93c9911a6971 | Linux bridge agent | myhost | xxx   | True           |
| 9635d63a-1606-4baf-9093-e99f3eaf0552 | DHCP agent         | myhost | xxx   | True           |
| 9cfb2966-5cc0-4528-9bf6-388703d0d203 | Metadata agent     | myhost | xxx   | True           |
| e6bbaf50-adfb-47fd-9e22-e27624e16704 | L3 agent           | myhost | xxx   | True           |
+--------------------------------------+--------------------+-----------+-------+----------------+

as opposed to them being reported alive.

Note that is its expected to see output (1) - server isn't started, or
output (2) - [some] agents don't report being alive, for a while, but
not too long, certainly not minutes.

The place where they hang seems to always be the RPC connections' setup
to the AMQP server, in
neutron.openstack.common.rpc.impl_kombu.Connection.reconnect, on the
time.sleep(sleep_time) - final statement of that method.

I strongly believe it only happens if a NTP time step happens while the
reconnection attempts are being made. The sleep_time variable is set to
1 initially and grows by 2 seconds on each retry, for a maximum of 30
retries.

However, it has been observed by stracing the process, that the
underlying select(2) call, using which time.sleep is implemented in
Python, seems to operate with a timeout parameter of more than 15
minutes on the affected machine, which approximately coincides with the
NTP stepping difference for that machine. Associated neutron
server/agent logs show a similar step of 15 minutes between messages at
some point. If an agent run happens to not be affected, the message with
the time skip is from elsewhere in the codebase.

I'd be seriously baffling if GNU C library's select(2) were truly
affected by setting the clock, but I have no other explanation at this
time.

A way to verify that the bad condition is happening is, for example on
the DHCP agent:

# get PID of the agent and verify it's running:
$ ps aux | grep neutron-dhcp-agent
neutron   2740  0.3  0.0 110228 34976 ?        Ss   16:38   0:00 /usr/bin/python /usr/bin/neutron-dhcp-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log

# check that it's waiting in select with a large seconds parameter.
$ sudo strace -p 2740
Process 2740 attached
select(0, NULL, NULL, NULL, {753, 918063}^CProcess 2740 detached
 <detached ...>

The example above shows remaining timeout of almost 753.9 seconds. The
value reported is not the original value, it decreases as you strace the
process again.

A workaround to this issue would be forcing the neutron upstart jobs to
only run after ntp steps the clock. Since ntp is an sysv job and not an
upstart job, the easiest way to achieve this is waiting for all sysv
jobs to finish, which coincides with the 'rc' upstart job's stopping.

Hence, the proposed workaround is to edit the neutron upstart jobs:

sudo sed -i -e '/start on/s/$/ and stopped rc/' /etc/init/neutron*conf

Description:	Ubuntu 14.04.1 LTS
Release:	14.04

ii  neutron-common                            1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - common
ii  neutron-dhcp-agent                        1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - DHCP agent
ii  neutron-l3-agent                          1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - l3 agent
ii  neutron-metadata-agent                    1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - metadata agent
ii  neutron-plugin-linuxbridge                1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - linuxbridge plugin
ii  neutron-plugin-linuxbridge-agent          1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - linuxbridge plugin agent
ii  neutron-plugin-ml2                        1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - ML2 plugin
ii  neutron-server                            1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - server

** Affects: neutron (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1366004

Title:
  Sleep in RPC reconnect is affected by NTPd on neutron services'
  startup

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1366004/+subscriptions