[Bug 1518430] Re: liberty: ~busy loop on epoll_wait being called with zero timeout
John Eckersberg
jeckersb at redhat.com
Fri Oct 14 14:52:37 UTC 2016
Copy/paste from https://bugzilla.redhat.com/show_bug.cgi?id=1384183#c10:
In the oslo.messaging rabbit driver, there is a thread for heartbeats:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968
specifically, the three methods, _heartbeat_start, _heartbeat_stop,
and _heartbeat_thread_job.
In _heartbeat_start, it creates a threading.Event (this event ends up
being the root of the problem), a threading.Thread for the heartbeat
thread, and then starts the heartbeat thread.
The heartbeat thread does work, and then at the end executes:
self._heartbeat_exit_event.wait(
timeout=self._heartbeat_wait_timeout)
The _heartbeat_stop method sets the exit event when the connection is
stopped, this is where it receives that and shuts down.
With the default heartbeat configuration, _heartbeat_wait_timeout ends
up being 15.0 seconds. This sounds reasonable, but explodes under the
hood.
Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of
code:
# Balancing act: We can't afford a pure busy loop, so we
# have to sleep; but if we sleep the whole timeout time,
# we'll be unresponsive. The scheme here sleeps very
# little at first, longer as time goes on, but never longer
# than 20 times per second (or the timeout time remaining).
endtime = _time() + timeout
delay = 0.0005 # 500 us -> initial delay of 1 ms
while True:
gotit = waiter.acquire(0)
if gotit:
break
remaining = endtime - _time()
if remaining <= 0:
break
delay = min(delay * 2, remaining, .05)
_sleep(delay)
Now, the problem is that _sleep here is time.sleep, which has been
previously monkey-patched by eventlet to be eventlet.greenthread.sleep
instead.
When you call sleep() in eventlet, it adds a timer onto an internal
heap that it maintains[2]. When the main event loop runs, it checks
whatever the shortest timer is, and uses that delta as the argument
for waiting (this ultimately ends up to epoll_wait). Here's the
crucial part of the code[3]:
wakeup_when = self.sleep_until()
if wakeup_when is None:
sleep_time = self.default_sleep()
else:
sleep_time = wakeup_when - self.clock()
if sleep_time > 0:
self.wait(sleep_time)
else:
self.wait(0)
Because Condition.wait() uses such small sleep times between
0.0005..0.05, these tiny timers are almost always the next timer to
fire, and sleep_time above ends up being negative (or small enough to
practically be zero anyways). The self.wait() call propagates down to
epoll_wait() and you get the behavior we're seeing.
So, the root of the problem is that using threading.Event inside of
eventlet is bad.
[1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344
[2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392
[3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1518430
Title:
liberty: ~busy loop on epoll_wait being called with zero timeout
Status in OpenStack Compute (nova):
Invalid
Status in oslo.messaging:
In Progress
Status in nova package in Ubuntu:
Invalid
Status in python-oslo.messaging package in Ubuntu:
New
Bug description:
Context: openstack juju/maas deploy using 1510 charms release
on trusty, with:
openstack-origin: "cloud:trusty-liberty"
source: "cloud:trusty-updates/liberty
* Several openstack nova- and neutron- services, at least:
nova-compute, neutron-server, nova-conductor,
neutron-openvswitch-agent,neutron-vpn-agent
show almost busy looping on epoll_wait() calls, with zero timeout set
most frequently.
- nova-compute (chose it b/cos single proc'd) strace and ltrace captures:
http://paste.ubuntu.com/13371248/ (ltrace, strace)
As comparison, this is how it looks on a kilo deploy:
- http://paste.ubuntu.com/13371635/
* 'top' sample from a nova-cloud-controller unit from
this completely idle stack:
http://paste.ubuntu.com/13371809/
FYI *not* seeing this behavior on keystone, glance, cinder,
ceilometer-api.
As this issue is present on several components, it likely comes
from common libraries (oslo concurrency?), fyi filed the bug to
nova itself as a starting point for debugging.
Note: The description in the following bug gives a good overview of
the issue and points to a possible fix for oslo.messaging:
https://bugs.launchpad.net/mos/+bug/1380220
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1518430/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list