selftest performance: landing our code faster and developing quicker.

Fri Aug 28 02:15:56 BST 2009

Thanks for pushing this down; the speed of the tests makes a big
difference to the degree we use and benefit from them.

> I've thought of some things we can do - I'm sure there are more:
>  - split out tests that we really only want to run rarely. E.g.
>   bzrlib.tests.test_setup.TestSetup.test_build_and_install might be
>   appropriate to run less often. OTOH its very nice to know that bzr
>   doesn't build on a given platform.

We should probably make sure if any compiled extensions are missing
that gives a clear warning in the test run; it's ok to test without
them but you should know.

I think the test run overall if giving a bit too much noise: it should
be basically red/green, and then stuff about missing dependencies
should be so concise that it's blank for most developers most of the
time and therefore noticeable when it's not.

I guess it comes down to a question of what tests do you want to have
deferred or removed and when.  test_build_and_install may catch some
problems.

It would be interesting (maybe in practical) to accumulate data on all
bzr test runs in the world and see which tests actually ever fail.

>  - run only tests that execute changed code. This requires a cache call
>   graph that needs to be pretty fast to answer this question. I have
>   a sketch on my laptop from some months back. I hope to finish this
>   after 2.0
>  - think hard about what you want to test when you write a test. If you
>   want to test how an error looks: test that specifically, not how
>   its raised or when its raised. (E.g. add a test to test_errors, not
>   to bzrlib.blackbox.test_CMD).
>  - profile your tests. Run bzr selftest -v <test_id>  and
>   see if your test is taking a reasonable amount of time. if its more
>   than 70ms, its higher than the median - and well worth spending a
>   little time considering how to bring it down. One good way to see
>   what a test is doing is selftest --lsprof-tests <test_id>.

>  - use test doubles. I don't mean 'Mocks' or 'Stubs' specifically (and
>   they are different :P). I mean though, that you can make use of such
>   tools and similar things (like MemoryTree and MemoryTransport) to
>   reduce the amount of unrelated work done by your test. TreeTransform,
>   today, doesn't count as a test double, because it performs IO. This
>   is well worth fixing though, because as Aaron has commented
>   MemoryTree is a small subset of WorkingTree and may not be
>   trustworthy as a test double. (This doesn't mean that it isn't, just
>   that its not known-great).

This is kind of a large topic but I think the key point is that the
fast method must be both realistic *and* easier than writing it the
simple way for it to be universally used.

It's no good having an infinitely fast implementation for testing if
it's so different to the regular code that people avoid using it.  To
the person writing just one more test, the tradeoff is in favor of
ease of authoring, not runtime speed.  And in fact tests are already
not so easy to write as we would like.

If we could, for example, let people write blackbox tests in something
that looks like shell doctest, but that's actually abstracted to be
much faster than running from the whole command line down, that would
be very cool.

In general if people can express intent like "I need a real disk
directory" or "I need some kind of transport" or "I need a branch
history with merges" then we can reuse that across tests and make the
implementation of it fast.

We find a fair number of bugs through failures that are only
accidentally testing the place where the bug occurs.  Given that
exhaustive path coverage is impossible this is actually something to
be welcomed.

Python gives fairly weak assurance that interfaces actually match up,
so I think it's relatively more important that we do test things
integrated together rather than in isolation.  For example in the case
of exception formating: it's fine to construct an exception and check
its string repr.  But that still leaves many cases where the exception
_as actually raised_ doesn't look right: I fixed one the other day and
then had to fix it _again_ on pqm because the underlying class is
different in python2.4.

The other important thing here is to design for testability by making
classes or functions with few dependencies, or making things that are
relatively close to being pure functions.  spiv had one of these
yesterday: by making the batching code be a class that says "I take
some requests and batch them up" you can test that without needing to
construct the complex situation of repository history where batching
becomes poor.

This leads me to think that we want to test moderately large stacks
hooked together, and stub them out only at the boundary where things
become expensive, and preferably at a narrow interface.  For example I
think MemoryTree or MemoryTransport are good in this way.

The fewer lines of test-specific code we have, while increasing speed
and the expressiveness of tests, the better.

>  - Do less IO. IO is slow. We go to great lengths with our lockdir and
>   disk data structures to be fast, but atomic - and doing atomicish
>   operations tends to make the filesystem batch up a chunk of
>   operations. In ext4, I hear that many fewer temporary files will
>   ever hit disk - but we'll still be serialising and deserialising all
>   that data. Even so, all that data being pushed out to kernel space
>   and back has an overhead, and we'll be paying it. On my laptop the
>   disk light stays locked on pretty much the entire test run - once the
>   tests get 5-10 seconds in.

Run on a tmpfs - last time I measured it was over a third faster.

-- 
Martin <http://launchpad.net/~mbp/>