[MERGE] timing sensitive test can cause spurious failures

Thu Aug 31 03:42:39 BST 2006

John Arbash Meinel wrote:
> Andrew Bennetts wrote:
[...]
> > Timing-sensitive tests suck.  They will always break for someone somewhere,
> > because there's always a system that's faster or slower than you ever expected.
> > 
> > Also, it's probably a good habit to think of time.sleep as a low-resolution
> > hint, rather than something to be trusted.
> > 
> 
> Well the test under consideration should actually be taking ~1ms. It is
> something *really* simple. We've already bumped it from being <10ms to
> being <100ms. We discussed it, and we don't really want to go above
> that. Robert explicitly didn't want it to be too loose, and miss a real bug.
> 
> You machine is taking 100 times too long to respond to a simple
> time.sleep() call. It might be better to use a loop, or something like
> that, which should take a little while, but maybe be less sensitive than
> time.sleep().

Sorry, I just don't think you can assume that a reliable amount of time will
pass between one python bytecode and the next, even if it's invoking time.sleep.

I wasn't paying close attention, but I suspect what happened is something like a
sudden burst of IO activity happened elsewhere on my laptop, causing other
processes to not be scheduled for >10ms.  A sudden swap storm, spam filtering,
cron job, annoying flash animation on a web browser, running two test suites
simultaneously, all sorts of things, could cause this.  In the extreme, I might
suspend my laptop and wake it up again an hour later, or NTP might have adjusted
my system clock...

What are we trying to test here?  It surely isn't that the nanosleep syscall
will return just after the requested 7ms within a miniscule tolerance -- in fact
the man page for it on Linux says that:

  a) it only guarantees to return after *at least* the time specified (unless a
     signal interrupts it).  *Not* that it will return between at least the time
     specified and some tiny gap.
  b) it has a resolution of 1/HZ -- on older linuxes, that'll be 10ms, on
     current 2.6 kernels on i386 that's 4ms, I think.  So even assuming a
     totally quiet system except for this sleeping process, time.sleep(0.007)
     could return 11ms later.

Seperately, there are platforms such as Windows where time.time() has a very low
resolution.  On Windows it's about 0.055ms (18.2Hz).  Who knows what resolutions
other platforms have; at
http://mail.zope.org/pipermail/zodb-dev/2002-May/002702.html Tim Peters says
that the "C standard doesn't require any minimum level of resolution".
Incidentally, that was in a thread about a test failure in ZODB involving
time.time/time.sleep...

Basically, almost any possible use of time.sleep or time.time in a test is a
problem waiting to happen.  This is especially true if it assumes that
time.sleep or time.time will have millisecond or better resolution.

If you want to test how some code behaves as time passes, paramterise the way it
reads from the clock, so that you can control the code's knowledge of what the
time is.  In this case, change TestCase.time like this:

=== modified file 'bzrlib/tests/__init__.py'

--- bzrlib/tests/__init__.py    2006-08-30 06:54:56 +0000
+++ bzrlib/tests/__init__.py    2006-08-31 02:09:46 +0000
@@ -683,7 +683,7 @@
         """
         if self._benchtime is None:
             self._benchtime = 0
-        start = time.time()
+        start = self._get_time()
         try:
             if not self._gather_lsprof_in_benchmarks:
                 return callable(*args, **kwargs)
@@ -694,7 +694,10 @@
                 self._benchcalls.append(((callable, args, kwargs), stats))
                 return ret
         finally:
-            self._benchtime += time.time() - start
+            self._benchtime += self._get_time() - start
+
+    def _get_time(self):
+        return time.time()

     def _runCleanups(self):
         """Run registered cleanup functions.


Now you can override _get_time in the offending test to e.g. return
self._dummy_time, initially zero.  The method_that_times_a_bit_twice the test
uses would increment the self._dummy_time, rather than calling time.sleep.

This has the pleasant side-effect of running a bit faster, although for this
test it's only saving (at least) 14ms, I've seen tests with multi-second sleeps
before, and they really bog down a test suite for no good reason.

So, we don't want to test the behaviour of time.sleep, or time.time, we leave
that to the people developing the underlying OS.  We do want to "Test that the
TestCase.time() method accumulates a benchmark time."  So let's test *that*
method, and leave the unpredictable platform behaviour out of it.  I would go so
far as to treat time.sleep/time.time calls in tests the same way as bare
excepts: assumed buggy unless there's a comment justifying why it's safe, and
even then stay a bit suspicious :)

Finally, zero tolerance to intermittently failing tests is a healthy attitude.
One or two of them that fail rarely, and you barely notice... then one day
there's twenty different ones that might fail with varying frequency, and half
your test runs on certain systems fail for no good reason, and you tear your
hair out trying to fix it all.  This happened to Twisted (in part because of
tests using time.time/time.sleep), and we're *still* trying to undo all the
damage years later.

This test can fail today.  It's already hit me.  What are we going to do to fix
it?

-Andrew.