this week in ~canonical-bazaar
John Arbash Meinel
john at arbash-meinel.com
Tue Oct 25 10:46:47 UTC 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> poolie: * went to the codecon camp; was pretty interesting; have an
> idea for a tiny performance testing program in
> <https://launchpad.net/judge>
One interesting thought if you want to work on the stats. If you get a
run that isn't statistically significant after 10 runs, you could
estimate the 'power' of your test, and determine what N it would take
to be able to determine the statistical significance.
http://en.wikipedia.org/wiki/Sample_size
Basically, given the observed averages and the observed standard
deviation, how many samples would you need to get statistical
significance.
The actual wiki link doesn't give quite the right inversion of your
ttest_ind, but maybe the stats package would have it?
As a very quick example, you would like the 95% confidence interval of
the mean to not include the other mean. Which uses:
n = 16σ^2/W^2
So if program A takes 10s, and program B takes 12s, and you have a
standard deviation of 2s (for these purposes assume the variation is
the same for both programs.) Then you want W to be <=2, and your σ=2,
meaning you need 16*4/4 = 16 samples.
When I tried it, with artificial numbers from
random.normalvariate(10+i*2, 4), I found some interesting results.
With the *same* underlying random functions:
mean sd min max cmd
9.768 3.516 1.562 14.854 a
10.639 4.923 1.952 21.806 b
commands are probably NOT statistically different
p=0.967
...
mean sd min max cmd
9.032 4.236 1.371 14.705 a
12.688 2.855 7.671 18.480 b
commands are VERY PROBABLY Statistically different
p=0.008
...
mean sd min max cmd
9.483 2.866 3.726 16.010 a
12.128 3.663 5.595 18.902 b
commands are PROBABLY statistically different
p=0.015
...
mean sd min max cmd
10.732 3.883 2.573 16.972 a
11.370 3.306 5.609 18.263 b
commands are probably NOT statistically different
p=0.615
...
mean sd min max cmd
9.927 4.421 0.660 19.199 a
13.997 2.812 7.213 18.854 b
commands are PROBABLY statistically different
p=0.018
The variation in p is pretty surprising to me. From 0.018 which is
almost VERY SIGNIFICANT down to 0.967 very-very-not-signficant. This
is using nrounds = 16.
Even with 50 rounds, I could get p=0.88, though most of the time it
was between 0.01 and 0.10.
Still, it shows why large N is important when σ is a large part of the
average.
Anyway, just a thought, using Power to indicate the confidence in your
confidence can be useful.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk6mk5cACgkQJdeBCYSNAAOcjwCaAxkis3oV0MBdlZwaZqPyVMhg
RkAAoIyE8xLa7IpSFFXC5N5+bxcq1A6U
=uZtZ
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list