judge (was Re: this week in ~canonical-bazaar)

Wed Oct 26 04:05:42 UTC 2011

Martin Pool writes:

 > Yes, I was going to look for one once I thought I had it basically
 > working.  Are you a stats pro?

Well, I'm paid to teach it. ;-) Let's talk off-line, and let it be
known that I would not be at all insulted if you go with somebody
else, so anybody should feel free to step up or recommend somebody
else.

 > Part of my idea here is that if we can get a bit of mechanized wisdom
 > about how to validly compare two series of timing data, that would be
 > more reliable than people either just eyeballing it, or making their
 > own guess about what technique to use.

As Matthew points out, the eye is pretty damn good at recognizing
patterns in graphically displayed data.  Statistics are for cases
where the patterns are a distraction, or where you need a mechanical
decision procedure, or you're submitting to an academic journal in
neuroscience.  And of course they may be needed to construct the data
required for interocular trauma by graphics.

 > Right, I'm not surprised to see it vary.  However, I think John is
 > broadly on the right track in wanting repeated sets of trials of the
 > same two commands to give fairly stable results; if they don't then
 > judge is not useful in describing whether there is really a difference
 > or not.

 > Right, and this suggests that perhaps we should invert it and rather
 > than giving the probability that the difference in means would not
 > arise by chance, instead give the percentage difference in mean value
 > at a confidence level: 99% sure that the new program is > 5% faster.

Sure, you want the results to be fairly stable, and the way to do that
is to set a criterion for "noticably better", then do enough
repetitions that the test is powerful at high significance levels when
the true difference is at least as high as the criterion.  Which is
basically what John said, except that he didn't give an explicit
statement about "noticably better".  You've added that, but you should
note that "noticably better" is going to vary with the implemented
function.  5% of a function that takes 1% of a command's run time is
not noticably better.  (That's a strawman example since you're talking
about testing commands, but the same idea applies to commands.  Eg, I
doubt anyone will send flowers if "bzr whoami" gets 5% faster.)

5% is probably a good ballpark criterion for the kinds of things
you're likely to make the effort to test.