[Launchpad-dev] Bzr plugin for guessing relevant test modules: Fault line

Fri Jun 3 07:29:44 UTC 2011

>>>>> Aaron Bentley <aaron at aaronbentley.com> writes:

    > On 11-06-02 12:42 PM, vila wrote:
    >>>>>>> Aaron Bentley <aaron at aaronbentley.com> writes:
    >> Yup. I've been thinking about the coverage-based approach for quite some
    >> time: run each test in isolation, establish which lines are covered,
    >> then, from a diff, for each line modified, add the corresponding tests.

    > The issue I thought of with coverage-based testing is that you might get
    > spurious positives.  For example, most of our tests will cover
    > Branch.last_revision, so if you change that, which ones should you run?

All. Doesn't that suck ? :-/

There is no way to ensure that a change doesn't need to be tested. If I
modify the test framework for example, all tests will probably be seen
as impacted.

But that doesn't matter that much after all. If I can run only 10% of
the whole test suite in 90% of the cases, I'm already damn happy to.

    >> There is quite a high price to pay upfront but then the increments
    >> should reasonably fast to calculate and reasonably accurate. But I won't
    >> bet this could replace a full test suite run.

    > No, but Launchpad's test suite takes several hours to run.  Because of
    > this, shortcuts that help us find the likely test failures quickly are a
    > bigger win for Launchpad than for bzr.

Right.

IMHO, what matters is to run the most relevant tests first. The most
interesting effect of your plugin is to give a hint about where to
*start* when modifying a piece of code you don't know or don't know
*enough* to know which tests are relevant. I.e. you don't have to
introduce an obvious bug and run the whole test suite to have a vague
idea about which tests can cover your work.

Once you get more familiar with the code, you then know where to focus
(which subset of the suite is more likely to contain a failure) and from
there, reduce the iteration duration.

It's all about learning really. A failing test will always teach me more
about the code I'm modifying than a succeeding one. Ideally a single
test needs to be run: a failing one. We are far from that because a bug
in the implementation will also make higher level tests fail (that's
another issue but still related here: when a bunch of tests are failing
how do I find the most relevant one ?).

So if both approaches help me to find some of the tests that may fail if
I introduce a bug or a new feature, they don't tell me which test is the
most relevant.

One of the basic assumption here is that tests that don't exercise a
piece of code can't make it fail so there aren't relevant and don't need
to be run because they won't teach me anything if they fail (and they
shouldn't fail if the code I modify doesn't impact them right ? ;).

Coming back to the spurious positives, if I modify an implementation
without modifying its contract, I should be be able to run only the
implementations tests and be confident that my job is done. To check
that I should be able to run another set of tests, larger, to test the
contract. If a failure is encountered there I'll learn that I missed
some point. So one difference between your plugin and a coverage based
approach is that the former should generally identify the implementation
tests while the later should also find the contract tests.

I don't think the ordering problem can be easily addressed. There are
various techniques that make it easier to solve though: avoid eager tests,
enhance defect localization. Both provide "relevant" tests that should
be run before the others.

Our collective knowledge about the code is captured by the test suite,
it's incomplete and imperfect, it can't (and probably never will) tell
us the most relevant order to run the suite.

Finding relevant *sub*sets already make us more productive and that's
great. Ordering this subset will make the iteration faster only if we
have failing tests.

So my humble approach overall is to write better failing tests. This
starts by making sure that each test has failed at least once. The next
step is to make sure all the code is covered which such tests and repeat
for the higher levels.

In the mean time, I rely on my knowledge of the test suite to run more
and more tests in rough order of relevance.

So all shortcuts that help me reduce the relevant subset are already a
big win, no matter how imperfect they are.

          Vincent