Wrote a small black-box test suite for bazaar [attn bzr-gtk, qbzr developers]
robertc at robertcollins.net
Thu Aug 28 01:43:03 BST 2008
I think I got off on the wrong foot; I've answered the point-by-point
stuff, but there is a deeper discussion at the end of the mail - sorry
for any confusion I've caused.
On Wed, 2008-08-27 at 16:01 +0200, Geoff Bache wrote:
> Do note that repository data is not the
> same from run to run even with the same format.
> Yes. This doesn't matter because
> a) we only compare what we want to compare, not every file in the
> b) we have a way of filtering out parts of files that change from run
> to run.
Well this reduces the value of the test, to me.
> > > Fork+exec is also slow when done
> > > thousands of times, which is why we only fork & exec
> sufficient to test
> > > that the test harness is robust and beyond that do it all
> > OK. I'm not claiming that this kind of test can replace your
> entire test suite,
> > or that that is even practical at this stage of development.
> I doubt whether the
> > extra time taken by fork+exec is going to have that much of
> an impact as a
> > percentage of total time, but you'd need to write a lot of
> tests to find out.
> > TextTest has support for parallelising the tests on several
> machines if slowness
> > becomes an issue. With a test-suite, the important thing is
> whether it can be
> > run (a) "while I watch" (b) "while I get a coffee" or (c)
> "only at night" so
> > performance changes of the order of (say) 10% don't make
> much difference in
> > practice.
> Heh. They way you migrate from a) to b) is in a series of 1
> and 2
> percent increments. That said, a fork+ exec test that runs bzr
> (say) 5
> times, is upwards of 1000ms. The same test in process is under
> Thats a 50 times difference. Its not slightly slower. Its
> I don't follow what you mean with "bzr (say)". You seem to be
> suggesting that you can give precise performance figures for a test
> that runs an arbitrary program performing an arbitary operation. Are
> these actual benchmark numbers of some sort? Can you provide more
> details if so?
I can give precise minimum test costs, because we know what the minimum
cost for starting up bzr is. John gave precise examples. We run fifteen
thousand test cases today. Every 10 of those we convert adds another
second to the overall test suite run - and its too long as it is.
> Running "while true ; do date ; done | uniq -c" in bash is a
> recognised way to test the performance of fork(), found by googling a
> bit. This produces around 600 forks per second on my (ancient Pentium
> 4) linux machine. If I print the date instead using "python -c 'import
> time; print time.asctime()'" I still get 50 forks a second. Either
> way, I don't come close to being as slow as 5 per second. With
> virtualisation I suppose it might be slower but 5 a second seems
So, windows fork() is _much_ slower, and python is the bottleneck, not
> My point is that this equation all depends on how many tests you have
> (or need to have), how long they take anyway, and what your
> environment is like.
15K test instances today; growing as we add code, shrinking as
deprecated interfaces get removed.
> I'm sorry, I didn't understand what the advantages over our
> current test
> strategy for command line operations was from that mail.
> <pasted from original mail - did you not see this or not understand
Didn't understand it.
> The main points:
> 1) You don't need to know the code to write tests (I've never looked
> at the code)
I don't see this as an advantage. Skipping all the development process
theory, it boils down to 'what failures will you catch that tests
written by people that know the code will not catch'.
> 2) Tests don't depend on the structure of the code and hence don't
> need changing when the code is refactored.
This implies tests that cannot leverage the structure of the code, and
thus must exercise all layers present, rather than the layer that needs
testing. Say you have 10 layers in your code base, if all take the same
fraction of time in an operation (unlikely, but it works for reasoning
about this) then you are doing 10 times as much work as needed to test
the system-under-test. In reality the outer most layers probably do the
least work, so this ratio goes up to 100 or 1000 times the work to test
a command line interface's actual logic.
> 3) There are already quite a few blackbox tests that look like
> def test_update_standalone_trivial(self):
> out, err = self.run_bzr('update')
> self.assertEqual('Tree is up to date at revision 0.\n', err)
> self.assertEqual('', out)
> This is basically a way to write tests of that form without writing
> any code.
Sure. We try to keep those tests to an absolute minimum though, and if
you look at more of them our best practice is to use bzrlib API calls
tovalidate the operations - the _trivial forms should be in the absolute
Having looked more closely at texttest, I think I was confused by what
you were proposing. texttest is an acceptance user test test framework.
So, my feelings about this for bzr are:
- We have a domain language for writing tests for bzr - all the way
from core code to acceptance tests. Writing tests in a different
language only makes sense if we expect enough of those tests that
having their own DSL is a benefit for the authors and maintainers
rather than learning the existing domain language.
- the fork+exec model fits very poorly with the goal of running many
tests, and fits poorly on windows in general (and windows
portability is important to us).
- IMO user acceptance tests are not equivalent to black box tests -
they are testing that the users goals really are satisfied; so they
may sometimes be best represented by driving the UI, but may also
be othertimes best represented by driving the API. (For starters,
it depends on the 'user' - the folk writing qbzr and bzr-gtk need
API level tests, folk driving the CLI probably want UI tests, except
when the thing they are asking about really isn't a UI problem.
We have room to improve in our documentation though - we commonly have
examples, and currently we don't test the docs. We write our
documentation in ReST; doctest for python can pick examples out and run
them, but it needs more glue than we have today to allow (for instance)
$ bzr init foo
$ touch bar
$ bzr add
to be represented as a test - *and run on windows*.
I really must emphasis this, as changes to our docs by windows based
developers need to allow the developer to test the changes, once we have
that sort of facility.
I'm neutral on using texttest for implementing documentation testing; I
suspect we've probably got more support from the python community for
fixing doctest, as there is an existing dev community that use that
I'm against introducing more time into the main test suite unless it
actually increases the quality of our tests; and from my reading about
testtext it doesn't intrinsically do that - the primary thing it offers
(allowing 'non developers' to write tests) is a useful way to
communicate when you have problems with non developers telling
developers what they need, but I think 90% of the agile/xp angst around
that particular problem is a failure to embed the customer deeply enough
into the development process. And thats all about business - we're in
open source :).
Two related projects to bzr that may well love testtext are the bzr-gtk
and qbzr projects, which are writing GUI's - and unlike bzr's core,
don't have a really slick way to write comprehensive tests for gui
interacting code (or didn't last time I looked - I may be out of date).
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080828/0f525296/attachment.pgp
More information about the bazaar