[Bug 1366004] Re: Sleep in RPC reconnect is affected by NTPd on neutron services' startup

Thu Sep 8 11:54:09 UTC 2016

** Changed in: neutron (Ubuntu)
       Status: New => Triaged

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1366004

Title:
  Sleep in RPC reconnect is affected by NTPd on neutron services'
  startup

Status in neutron package in Ubuntu:
  Triaged

Bug description:
  Under certain situations, the neutron server and all its agents will
  hang for a while on startup and become unresponsive.

  The issue can be demonstrated by doing neutron agent-list shortly
  after reboot:

  $ neutron agent-list
  Connection to neutron failed: Maximum attempts reached

  or

  $ neutron agent-list
  +--------------------------------------+--------------------+-----------+-------+----------------+
  | id                                   | agent_type         | host      | alive | admin_state_up |
  +--------------------------------------+--------------------+-----------+-------+----------------+
  | 4306e16a-9786-4772-acbd-93c9911a6971 | Linux bridge agent | myhost | xxx   | True           |
  | 9635d63a-1606-4baf-9093-e99f3eaf0552 | DHCP agent         | myhost | xxx   | True           |
  | 9cfb2966-5cc0-4528-9bf6-388703d0d203 | Metadata agent     | myhost | xxx   | True           |
  | e6bbaf50-adfb-47fd-9e22-e27624e16704 | L3 agent           | myhost | xxx   | True           |
  +--------------------------------------+--------------------+-----------+-------+----------------+

  as opposed to them being reported alive.

  Note that is its expected to see output (1) - server isn't started, or
  output (2) - [some] agents don't report being alive, for a while, but
  not too long, certainly not minutes.

  The place where they hang seems to always be the RPC connections'
  setup to the AMQP server, in
  neutron.openstack.common.rpc.impl_kombu.Connection.reconnect, on the
  time.sleep(sleep_time) - final statement of that method.

  I strongly believe it only happens if a NTP time step happens while
  the reconnection attempts are being made. The sleep_time variable is
  set to 1 initially and grows by 2 seconds on each retry, for a maximum
  of 30 retries.

  However, it has been observed by stracing the process, that the
  underlying select(2) call, using which time.sleep is implemented in
  Python, seems to operate with a timeout parameter of more than 15
  minutes on the affected machine, which approximately coincides with
  the NTP stepping difference for that machine. Associated neutron
  server/agent logs show a similar step of 15 minutes between messages
  at some point. If an agent run happens to not be affected, the message
  with the time skip is from elsewhere in the codebase.

  I'd be seriously baffling if GNU C library's select(2) were truly
  affected by setting the clock, but I have no other explanation at this
  time.

  A way to verify that the bad condition is happening is, for example on
  the DHCP agent:

  # get PID of the agent and verify it's running:
  $ ps aux | grep neutron-dhcp-agent
  neutron   2740  0.3  0.0 110228 34976 ?        Ss   16:38   0:00 /usr/bin/python /usr/bin/neutron-dhcp-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log

  # check that it's waiting in select with a large seconds parameter.
  $ sudo strace -p 2740
  Process 2740 attached
  select(0, NULL, NULL, NULL, {753, 918063}^CProcess 2740 detached
   <detached ...>

  The example above shows remaining timeout of almost 753.9 seconds. The
  value reported is not the original value, it decreases as you strace
  the process again.

  A workaround to this issue would be forcing the neutron upstart jobs
  to only run after ntp steps the clock. Since ntp is an sysv job and
  not an upstart job, the easiest way to achieve this is waiting for all
  sysv jobs to finish, which coincides with the 'rc' upstart job's
  stopping.

  Hence, the proposed workaround is to edit the neutron upstart jobs:

  sudo sed -i -e '/start on/s/$/ and stopped rc/' /etc/init/neutron*conf

  Description:	Ubuntu 14.04.1 LTS
  Release:	14.04

  ii  neutron-common                            1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - common
  ii  neutron-dhcp-agent                        1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - DHCP agent
  ii  neutron-l3-agent                          1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - l3 agent
  ii  neutron-metadata-agent                    1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - metadata agent
  ii  neutron-plugin-linuxbridge                1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - linuxbridge plugin
  ii  neutron-plugin-linuxbridge-agent          1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - linuxbridge plugin agent
  ii  neutron-plugin-ml2                        1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - ML2 plugin
  ii  neutron-server                            1:2014.1.2-0ubuntu1.1                               all          Neutron is a virtual network service for Openstack - server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1366004/+subscriptions