selftest performance: landing our code faster and developing quicker.

Thu Aug 27 10:18:53 BST 2009

>>>>> "robert" == Robert Collins <robert.collins at canonical.com> writes:

<snip/>

    robert> vila can run it in 4 minutes,

I protest ! I have a patch to make it run in 3 minutes but you
rejected it for reasons too long to repeat here !

:-D

The test farm is essentially running on a single quad-processor
with one slave (jaunty) being native (seeing the quad-core as 8
procs) and two other (hardy and karmic) being virtual machines
(each configured with 4 procs).

As can be seen on the test farm waterfall view, running the full
test suite (without plugins) twice (with and without locale set)
takes ~26 minutes on this three slaves concurrently.

That's running the test suite 6 times so yes, 4 minutes is a good
approximation but I suspect running with '-v' instead of
'--subunit' and injecting that in hudson may reduce the overhead
and the IO involved (the slave reports to the master in real
time).

<snip/>

    robert> We have a number of things contributing to this:
    robert>  - some of our test parameterisation runs many permutations we expect to
    robert>    fail. we could fairly cheaply just not create those.
    robert>  - we have some tests that need to do expensive things
    robert>  - some of our test support code is slower than it needs to be.
    robert>  - We run our core code several hundred thousand times, and our UI code 
    robert>    only once or twice for some tests.
    robert>  - IO is slow.

I agree with that.

But additionally, another way to look at it, is to say that too
many tests are too high level.

We may discuss about whether holes in integration tests are worse
than holes in unit tests, but I still think that holes in unit
tests are harder to identify, diagnose and takes more time to
fix, so I'd rather have less of them.

<snip/>

    robert> Is this a problem, you may be asking? I think it is,
    robert> because of what it affects - and the heart of it is
    robert> cycle time: the time to make a change and be
    robert> confident its correct.

+1e6 That was the driver for --starting-with.

For people who don't know the option, it allows loading only a
part of the test suite and running only the loaded tests. Once
you are a bit familiar with the area you are testing you can work
like:

   ./bzr selftest http --list | wc -l
    789

Take a look at the list and chose the most relevant to your
actual modification:

   ./bzr selftest -s bzrlib.tests.test_http.TestAuth.test_empty_pass
   Ran 12 tests in 0.371s

Then progressively go up:

  ./bzr selftest -s bzrlib.tests.test_http.TestAuth
  Ran 125 tests in 3.782s

  ./bzr selftest -s bzrlib.tests.test_http.
  Ran 629 tests in 19.032s

629 is still not the initial 789, but you got the idea: you can
progressively widen the scope of the tests.

<snip/>

    robert> If we could get down to (say) 1ms each, we'd still be
    robert> looking at 22 seconds: but that would be tolerable,
    robert> for a full test run. We're currently at 140ms (more
    robert> or less). So 1ms is a very high bar to set.  However,
    robert> the median is 74ms, so actually we're half way there
    robert> for half of our tests :).

    robert> I've thought of some things we can do - I'm sure there are more:

    robert>  - split out tests that we really only want to run
    robert>  rarely. E.g.
    robert>  bzrlib.tests.test_setup.TestSetup.test_build_and_install
    robert>  might be appropriate to run less often. OTOH its
    robert>  very nice to know that bzr doesn't build on a given
    robert>  platform.

+1, there are some suites that doesn't need to be run at each
commit but will find their place nicely in the test farm.

    robert>  - run only tests that execute changed code. This
    robert>  requires a cache call graph that needs to be pretty
    robert>  fast to answer this question. I have a sketch on my
    robert>  laptop from some months back. I hope to finish this
    robert>  after 2.0

I've thought about that for a long time (I mentioned it at our
last London sprint) without finding the time to work on it. I
firmly believe that from:

- the list of modules mentioned by --coverage when a single test
  is run, established for each revision,

- the list of modules changed since the last revision,

We should get a test list that ought to represent less than 10%
of the tests in more than 90% of the cases.

<snip/>

    robert>  - profile your tests. Run bzr selftest -v <test_id>  and
    robert>    see if your test is taking a reasonable amount of time. if its more
    robert>    than 70ms, its higher than the median - and well worth spending a 
    robert>    little time considering how to bring it down. One good way to see
    robert>    what a test is doing is selftest --lsprof-tests <test_id>.

Or even simply selftest -v, if used with a sufficiently focused
'-s' the output remains small and can be read quickly.

     Vincent