Wrote a small black-box test suite for bazaar [attn bzr-gtk, qbzr developers]

Thu Aug 28 13:14:39 BST 2008

> >
> >         Do note that repository data is not the
> >         same from run to run even with the same format.
> >
> >
> > Yes. This doesn't matter because
> > a) we only compare what we want to compare, not every file in the
> > repository.
> > b) we have a way of filtering out parts of files that change from run
> > to run.
>
> Well this reduces the value of the test, to me.
>

Why? You can compare exactly what you want at the level of detail that's
appropriate. How can flexibility reduce the value of anything?

> I can give precise minimum test costs, because we know what the minimum
> cost for starting up bzr is. John gave precise examples. We run fifteen
> thousand test cases today. Every 10 of those we convert adds another
> second to the overall test suite run - and its too long as it is.

OK, John demonstrated convincingly that the slowness is in "import bzrlib"
and not in either python or fork/exec. It's therefore likely that a large
test suite of this sort of test will be slower than what you have today
*when run on a single machine*. How much slower is open to question but it's
probably seconds rather than milliseconds.

But I regularly run a test suite that takes 20 minutes run on a single
machine in around 30 seconds, because TextTest has support for parallel
testing. So if it's too long as it is, would it not be interesting to be
able to run it in parallel? Even a handful of machines linked together could
end up reducing the time dramatically and more than making up for the exra
"import bzrlib" time. Obviously being in-process wouldn't be possible anyway
in this scenario.

> Running  "while true ; do date ; done | uniq -c" in bash is a
> > recognised way to test the performance of fork(), found by googling a
> > bit. This produces around 600 forks per second on my (ancient Pentium
> > 4) linux machine. If I print the date instead using "python -c 'import
> > time; print time.asctime()'" I still get 50 forks a second. Either
> > way, I don't come close to being as slow as 5 per second. With
> > virtualisation I suppose it might be slower but 5 a second seems
> > extreme.
>
> So, windows fork() is _much_ slower, and python is the bottleneck, not
> fork()+exec() itself.
>

As John showed, import bzrlib is the bottleneck, not python. As for Windows
slowness, I personally handle this by testing on Linux normally and on
Windows only in a nightjob.

> The main points:
> > 1) You don't need to know the code to write tests (I've never looked
> > at the code)
>
> I don't see this as an advantage. Skipping all the development process
> theory, it boils down to 'what failures will you catch that tests
> written by people that know the code will not catch'.
>

I thought it was fairly well recognised that people who have implemented
something have a tendency to test it poorly. They have thought about the
problem in a certain way and can be blind to other ways of thinking about
it. Also, they have a tendency to test their own implementation rather than
the actual requirements.

Bottom line: tests written by developers are good at proving that the code
does what they intended it to do. They are less good at proving that it does
what it's supposed to do.

Also, people steeped in the code are generally a much smaller group than
those who have an interest in it behaving correctly.

> > 2) Tests don't depend on the structure of the code and hence don't
> > need changing when the code is refactored.
>
> This implies tests that cannot leverage the structure of the code, and
> thus must exercise all layers present, rather than the layer that needs
> testing. Say you have 10 layers in your code base, if all take the same
> fraction of time in an operation (unlikely, but it works for reasoning
> about this) then you are doing 10 times as much work as needed to test
> the system-under-test. In reality the outer most layers probably do the
> least work, so this ratio goes up to 100 or 1000 times the work to test
> a command line interface's actual logic.

But you need a lot less tests this way. Because you can have non-coders
involved, and because you necessarily focus on possible usage of the system
and interesting usage of the system, you don't end up writing thousands of
tests for things that will never happen in practice (or that nobody will
care about if they do)

Leveraging code structure is good if you want to provide well isolated units
that can seamlessly be used in a different context. In practice though most
code will only ever be used for one purpose as part of one system. It has
the big disadvantage that it "holds the code hostage" after a while:
nobody's going to redesign the code, however necessary it gets, if doing so
means rewriting 300 tests.

In any case, I haven't claimed this would replace the whole test suite.

> > 3) There are already quite a few blackbox tests that look like
> >
> > def test_update_standalone_trivial(self):
> >         self.make_branch_and_tree('.')
> >         out, err = self.run_bzr('update')
> >         self.assertEqual('Tree is up to date at revision 0.\n', err)
> >         self.assertEqual('', out)
> >
> > This is basically a way to write tests of that form without writing
> > any code.
>
> Sure. We try to keep those tests to an absolute minimum though, and if
> you look at more of them our best practice is to use bzrlib API calls
> tovalidate the operations - the _trivial forms should be in the absolute
> minority.
>
> Having looked more closely at texttest, I think I was confused by what
> you were proposing. texttest is an acceptance user test test framework.
> So, my feelings about this for bzr are:
>  - We have a domain language for writing tests for bzr - all the way
>   from core code to acceptance tests. Writing tests in a different
>   language only makes sense if we expect enough of those tests that
>   having their own DSL is a benefit for the authors and maintainers
>   rather than learning the existing domain language.

What do you mean by "domain language" here? I didn't really understand this
comment. Are you referring to what would happen with TextTest or to what is
the case now?

>
>  - the fork+exec model fits very poorly with the goal of running many
>   tests, and fits poorly on windows in general (and windows
>   portability is important to us).

If in-process speed is seen as all-important, yes. I personally think that
should be well down the list of priorities for an automated test suite,
especially if it's already well beyond being able to be run interactively.
And I think parallelising is easier than you might think and a better
generic solution to slow tests : hardware is cheap these days (though I
admit this equation is more complicated if all the developers are working at
home).

  - IMO user acceptance tests are not equivalent to black box tests -
>   they are testing that the users goals really are satisfied; so they
>   may sometimes be best represented by driving the UI, but may also
>   be othertimes best represented by driving the API. (For starters,
>   it depends on the 'user' - the folk writing qbzr and bzr-gtk need
>   API level tests, folk driving the CLI probably want UI tests, except
>   when the thing they are asking about really isn't a UI problem.

Yes, this is true. And IMO user acceptance tests are the most important and
most effective form of testing (though not usually the fastest..)

> We have room to improve in our documentation though - we commonly have
> examples, and currently we don't test the docs. We write our
> documentation in ReST; doctest for python can pick examples out and run
> them, but it needs more glue than we have today to allow (for instance)
>
> $ bzr init foo
> $ touch bar
> $ bzr add
>
> to be represented as a test - *and run on windows*.
>
> I really must emphasis this, as changes to our docs by windows based
> developers need to allow the developer to test the changes, once we have
> that sort of facility.
>
> I'm neutral on using texttest for implementing documentation testing; I
> suspect we've probably got more support from the python community for
> fixing doctest, as there is an existing dev community that use that
> extensively.
>

I would agree. TextTest isn't designed for testing documentation, doctest
is.
I don't think that would be a sensible use for the tool.

> I'm against introducing more time into the main test suite unless it
> actually increases the quality of our tests; and from my reading about
> testtext it doesn't intrinsically do that - the primary thing it offers
> (allowing 'non developers' to write tests) is a useful way to
> communicate when you have problems with non developers telling
> developers what they need, but I think 90% of the agile/xp angst around
> that particular problem is a failure to embed the customer deeply enough
> into the development process. And thats all about business - we're in
> open source :).
>

Interesting. You do have a point there, the open source equation is a bit
different as the developers are usually also users which eliminates many of
the problems with requirements and different perspectives. I don't think the
issue goes away
entirely but it is less of an issue.

It's clear that there is also an inbuilt reluctance to change operating here
:) I respect that, you've already invested a lot in the current setup. Maybe
things would be different if it was at an earlier stage.

> Two related projects to bzr that may well love testtext are the bzr-gtk
> and qbzr projects, which are writing GUI's - and unlike bzr's core,
> don't have a really slick way to write comprehensive tests for gui
> interacting code (or didn't last time I looked - I may be out of date).
>

OK. If I feel inspired I might try and write a little test suite for bzr-gtk
: it's always easier to convince people who don't already have lots of tests
:)

Regards,
Geoff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20080828/73ba6889/attachment-0001.htm