[Bug 1518430] Re: liberty: ~busy loop on epoll_wait being called with zero timeout

Fri Oct 14 14:52:37 UTC 2016

Copy/paste from https://bugzilla.redhat.com/show_bug.cgi?id=1384183#c10:

In the oslo.messaging rabbit driver, there is a thread for heartbeats:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968

specifically, the three methods, _heartbeat_start, _heartbeat_stop,
and _heartbeat_thread_job.

In _heartbeat_start, it creates a threading.Event (this event ends up
being the root of the problem), a threading.Thread for the heartbeat
thread, and then starts the heartbeat thread.

The heartbeat thread does work, and then at the end executes:

    self._heartbeat_exit_event.wait(
        timeout=self._heartbeat_wait_timeout)

The _heartbeat_stop method sets the exit event when the connection is
stopped, this is where it receives that and shuts down.

With the default heartbeat configuration, _heartbeat_wait_timeout ends
up being 15.0 seconds.  This sounds reasonable, but explodes under the
hood.

Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of
code:

    # Balancing act:  We can't afford a pure busy loop, so we
    # have to sleep; but if we sleep the whole timeout time,
    # we'll be unresponsive.  The scheme here sleeps very
    # little at first, longer as time goes on, but never longer
    # than 20 times per second (or the timeout time remaining).
    endtime = _time() + timeout
    delay = 0.0005 # 500 us -> initial delay of 1 ms
    while True:
        gotit = waiter.acquire(0)
        if gotit:
            break
        remaining = endtime - _time()
        if remaining <= 0:
            break
        delay = min(delay * 2, remaining, .05)
        _sleep(delay)

Now, the problem is that _sleep here is time.sleep, which has been
previously monkey-patched by eventlet to be eventlet.greenthread.sleep
instead.

When you call sleep() in eventlet, it adds a timer onto an internal
heap that it maintains[2].  When the main event loop runs, it checks
whatever the shortest timer is, and uses that delta as the argument
for waiting (this ultimately ends up to epoll_wait).  Here's the
crucial part of the code[3]:

    wakeup_when = self.sleep_until()
    if wakeup_when is None:
        sleep_time = self.default_sleep()
    else:
        sleep_time = wakeup_when - self.clock()
    if sleep_time > 0:
        self.wait(sleep_time)
    else:
        self.wait(0)

Because Condition.wait() uses such small sleep times between
0.0005..0.05, these tiny timers are almost always the next timer to
fire, and sleep_time above ends up being negative (or small enough to
practically be zero anyways).  The self.wait() call propagates down to
epoll_wait() and you get the behavior we're seeing.

So, the root of the problem is that using threading.Event inside of
eventlet is bad.

[1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344
[2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392
[3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1518430

Title:
  liberty: ~busy loop on epoll_wait being called with zero timeout

Status in OpenStack Compute (nova):
  Invalid
Status in oslo.messaging:
  In Progress
Status in nova package in Ubuntu:
  Invalid
Status in python-oslo.messaging package in Ubuntu:
  New

Bug description:
  Context: openstack juju/maas deploy using 1510 charms release
  on trusty, with:
    openstack-origin: "cloud:trusty-liberty"
    source: "cloud:trusty-updates/liberty

  * Several openstack nova- and neutron- services, at least:
  nova-compute, neutron-server, nova-conductor,
  neutron-openvswitch-agent,neutron-vpn-agent
  show almost busy looping on epoll_wait() calls, with zero timeout set
  most frequently.
  - nova-compute (chose it b/cos single proc'd) strace and ltrace captures:
    http://paste.ubuntu.com/13371248/ (ltrace, strace)

  As comparison, this is how it looks on a kilo deploy:
  - http://paste.ubuntu.com/13371635/

  * 'top' sample from a nova-cloud-controller unit from
     this completely idle stack:
    http://paste.ubuntu.com/13371809/

  FYI *not* seeing this behavior on keystone, glance, cinder,
  ceilometer-api.

  As this issue is present on several components, it likely comes
  from common libraries (oslo concurrency?), fyi filed the bug to
  nova itself as a starting point for debugging.

  Note: The description in the following bug gives a good overview of
  the issue and points to a possible fix for oslo.messaging:
  https://bugs.launchpad.net/mos/+bug/1380220

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1518430/+subscriptions