judge (was Re: this week in ~canonical-bazaar)

Wed Oct 26 04:33:31 UTC 2011

On 26 October 2011 15:05, Stephen J. Turnbull <stephen at xemacs.org> wrote:

>  > Part of my idea here is that if we can get a bit of mechanized wisdom
>  > about how to validly compare two series of timing data, that would be
>  > more reliable than people either just eyeballing it, or making their
>  > own guess about what technique to use.
>
> As Matthew points out, the eye is pretty damn good at recognizing
> patterns in graphically displayed data.  Statistics are for cases
> where the patterns are a distraction, or where you need a mechanical
> decision procedure, or you're submitting to an academic journal in
> neuroscience.  And of course they may be needed to construct the data
> required for interocular trauma by graphics.

By 'eyeballing' I meant looking at only the numbers.

Graphs can be great useful, but in terms of communication in for
example a mail thread or a merge proposal it is useful to be able to
just say "25% faster."  Obviously there are big caveats around the
test case, hardware, and environment, but it gives some idea.

>  > Right, and this suggests that perhaps we should invert it and rather
>  > than giving the probability that the difference in means would not
>  > arise by chance, instead give the percentage difference in mean value
>  > at a confidence level: 99% sure that the new program is > 5% faster.
>
> Sure, you want the results to be fairly stable, and the way to do that
> is to set a criterion for "noticably better", then do enough
> repetitions that the test is powerful at high significance levels when
> the true difference is at least as high as the criterion.  Which is
> basically what John said, except that he didn't give an explicit
> statement about "noticably better".

> You've added that, but you should
> note that "noticably better" is going to vary with the implemented
> function.  5% of a function that takes 1% of a command's run time is
> not noticably better.  (That's a strawman example since you're talking
> about testing commands, but the same idea applies to commands.  Eg, I
> doubt anyone will send flowers if "bzr whoami" gets 5% faster.)

bzr whoami being faster would mostly indicate startup time has
improved, which is actually very worthwhile, and as it happens was
improved substantially in 2.4:

     mean        sd       min       max cmd
  217.290     8.983   202.589   246.617 bzr2.3 whoami
  126.535     5.200   120.863   137.503 bzr2.5 whoami
commands are VERY PROBABLY Statistically different
p=0.000

     mean        sd       min       max cmd
  123.256     4.104   114.939   129.637 bzr2.4 whoami
  126.772     6.871   116.399   137.195 bzr2.5 whoami
commands are probably NOT statistically different
p=0.215

> 5% is probably a good ballpark criterion for the kinds of things
> you're likely to make the effort to test.

Right, doubly so because an improvement of less than 5% is generally
not very worthwhile, and also because there is commonly 5% run-to-run
variation and anything smaller will be hard to confidently detect?

Do you know off hand how to calculate this?  Otherwise I think I can
work it out.

m