judge (was Re: this week in ~canonical-bazaar)
Martin Pool
mbp at canonical.com
Wed Oct 26 04:33:31 UTC 2011
On 26 October 2011 15:05, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > Part of my idea here is that if we can get a bit of mechanized wisdom
> > about how to validly compare two series of timing data, that would be
> > more reliable than people either just eyeballing it, or making their
> > own guess about what technique to use.
>
> As Matthew points out, the eye is pretty damn good at recognizing
> patterns in graphically displayed data. Statistics are for cases
> where the patterns are a distraction, or where you need a mechanical
> decision procedure, or you're submitting to an academic journal in
> neuroscience. And of course they may be needed to construct the data
> required for interocular trauma by graphics.
By 'eyeballing' I meant looking at only the numbers.
Graphs can be great useful, but in terms of communication in for
example a mail thread or a merge proposal it is useful to be able to
just say "25% faster." Obviously there are big caveats around the
test case, hardware, and environment, but it gives some idea.
> > Right, and this suggests that perhaps we should invert it and rather
> > than giving the probability that the difference in means would not
> > arise by chance, instead give the percentage difference in mean value
> > at a confidence level: 99% sure that the new program is > 5% faster.
>
> Sure, you want the results to be fairly stable, and the way to do that
> is to set a criterion for "noticably better", then do enough
> repetitions that the test is powerful at high significance levels when
> the true difference is at least as high as the criterion. Which is
> basically what John said, except that he didn't give an explicit
> statement about "noticably better".
> You've added that, but you should
> note that "noticably better" is going to vary with the implemented
> function. 5% of a function that takes 1% of a command's run time is
> not noticably better. (That's a strawman example since you're talking
> about testing commands, but the same idea applies to commands. Eg, I
> doubt anyone will send flowers if "bzr whoami" gets 5% faster.)
bzr whoami being faster would mostly indicate startup time has
improved, which is actually very worthwhile, and as it happens was
improved substantially in 2.4:
mean sd min max cmd
217.290 8.983 202.589 246.617 bzr2.3 whoami
126.535 5.200 120.863 137.503 bzr2.5 whoami
commands are VERY PROBABLY Statistically different
p=0.000
mean sd min max cmd
123.256 4.104 114.939 129.637 bzr2.4 whoami
126.772 6.871 116.399 137.195 bzr2.5 whoami
commands are probably NOT statistically different
p=0.215
> 5% is probably a good ballpark criterion for the kinds of things
> you're likely to make the effort to test.
Right, doubly so because an improvement of less than 5% is generally
not very worthwhile, and also because there is commonly 5% run-to-run
variation and anything smaller will be hard to confidently detect?
Do you know off hand how to calculate this? Otherwise I think I can
work it out.
m
More information about the bazaar
mailing list