this week in ~canonical-bazaar

Tue Oct 25 10:46:47 UTC 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> poolie: * went to the codecon camp; was pretty interesting; have an
> idea for a tiny performance testing program in
> <https://launchpad.net/judge>

One interesting thought if you want to work on the stats. If you get a
run that isn't statistically significant after 10 runs, you could
estimate the 'power' of your test, and determine what N it would take
to be able to determine the statistical significance.
 http://en.wikipedia.org/wiki/Sample_size

Basically, given the observed averages and the observed standard
deviation, how many samples would you need to get statistical
significance.

The actual wiki link doesn't give quite the right inversion of your
ttest_ind, but maybe the stats package would have it?

As a very quick example, you would like the 95% confidence interval of
the mean to not include the other mean. Which uses:

  n = 16σ^2/W^2

So if program A takes 10s, and program B takes 12s, and you have a
standard deviation of 2s (for these purposes assume the variation is
the same for both programs.) Then you want W to be <=2, and your σ=2,
meaning you need 16*4/4 = 16 samples.

When I tried it, with artificial numbers from
random.normalvariate(10+i*2, 4), I found some interesting results.
With the *same* underlying random functions:

     mean        sd       min       max cmd
    9.768     3.516     1.562    14.854 a
   10.639     4.923     1.952    21.806 b
commands are probably NOT statistically different
p=0.967
...
     mean        sd       min       max cmd
    9.032     4.236     1.371    14.705 a
   12.688     2.855     7.671    18.480 b
commands are VERY PROBABLY Statistically different
p=0.008
...
     mean        sd       min       max cmd
    9.483     2.866     3.726    16.010 a
   12.128     3.663     5.595    18.902 b
commands are PROBABLY statistically different
p=0.015
...
     mean        sd       min       max cmd
   10.732     3.883     2.573    16.972 a
   11.370     3.306     5.609    18.263 b
commands are probably NOT statistically different
p=0.615
...
     mean        sd       min       max cmd
    9.927     4.421     0.660    19.199 a
   13.997     2.812     7.213    18.854 b
commands are PROBABLY statistically different
p=0.018

The variation in p is pretty surprising to me. From 0.018 which is
almost VERY SIGNIFICANT down to 0.967 very-very-not-signficant. This
is using nrounds = 16.

Even with 50 rounds, I could get p=0.88, though most of the time it
was between 0.01 and 0.10.

Still, it shows why large N is important when σ is a large part of the
average.

Anyway, just a thought, using Power to indicate the confidence in your
confidence can be useful.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6mk5cACgkQJdeBCYSNAAOcjwCaAxkis3oV0MBdlZwaZqPyVMhg
RkAAoIyE8xLa7IpSFFXC5N5+bxcq1A6U
=uZtZ
-----END PGP SIGNATURE-----