B+Tree indices: ongoing progress

Wed Jul 2 03:59:07 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| So, with the sketch showing promise, I think this is worth doing a
| spike/mini sprint on - if we can drop this in with the stackable format,
| it could take us a step towards addressing some of our intrinsic
| performance issues. Or not. Either way, I'm going to keep at it for at
| least a bit more.
|
| I am writing a script to test the performance of the new index, by
| comparing it to the indices from an arbitrary repository. I'm
| refactoring my adhoc stuff to be more repeatable.
|
| The first thing is that jam's new tighter compression _really_ affects
| index creation time. The following is using my copy of bzr.dev's
| indices:
|
| Baseline overhead: Read 92994 keys in 7.583
| GraphIndex: Wrote 10147567 bytes in 11.921
| BTreeIndex: Wrote 5722256 bytes in 13.481
|
| Baseline overhead: Read 92994 keys in 7.531
| GraphIndex: Wrote 10147567 bytes in 11.870
| BTreeIndex: Wrote 4104336 bytes in 21.608
|
|
| The first set of figures are with index2 rev 7; the second with index2
| rev 8. So rev 8 is a huge increase over the current GraphIndex for
| writes, and 50% increase in time over rev 7, for a 28% reduction in size
| vs rev 7, and 60% reduction vs GraphIndex.
|
|
| Also, regarding zlib object copying:
|>>> import zlib
|>>> c = zlib.compressobj()
|>>> c.copy()
| <zlib.Compress object at 0x7f21b6ee2dd8>
|
| so my python has it - if yours doesn't, then that backs up my theory
| that it would be a hassle to use it :). Though we could try and
| fallback...
|
| -Rob
|
|
|

Summary, baseline b-trees on this machine is 6.6s. What you merged is up
at 13.3s. I can cut it down without much loss to 8.6s.
Ultimately, I felt it doesn't matter a lot when the total time to branch
was 1min+. But I'm happy enough to include a couple of the tweaks. If
you want to give feedback as to what you consider "safe", I'll give you
a patch.

Baseline overhead: Read 105288 keys in 3.474
GraphIndex: Wrote 11182456 bytes in 4.701
rev 7: (what you had)
BTreeIndex: Wrote 6172672 bytes in 6.602

my rev 10: (what you merged)
BTreeIndex: Wrote 4399104 bytes in 13.282

rev 9: (special case before we get 'chunk_size' bytes)
BTreeIndex: Wrote 4399104 bytes in 12.371

(special case before we get '2*chunk_size' bytes):
BTreeIndex: Wrote 4399104 bytes in 11.736

So, I can shave a couple seconds off with some quick checks on how many
bytes we've read so far.

alt rev 10: count the number of repacking ops, if > 2, just stop
BTreeIndex: Wrote 4501504 bytes in 10.520

alt rev 10b: count repacks, stop at > 1
BTreeIndex: Wrote 4833280 bytes in 9.983

alt rev 9: special case < chunk_size, if n_repacks > 1, stop
BTreeIndex: Wrote 4722688 bytes in 9.156

alt rev 9b: special case < 2*chunk_size, if n_repacks > 1, stop
BTreeIndex: Wrote 4612096 bytes in 8.648

alt rev 9c: special case < 2*chunk_size, if n_repacks > 0, stop
BTreeIndex: Wrote 5369856 bytes in 7.787

So... If you trust that we will always get at least 2:1 compression (it
seems reasonable given the amount of entropy in our revision_ids), then
I can cut the time down quite a bit, and retain *most* of the saving.

(9b) gives 4,612,096 versus 6,172,672 without repacking, is 1:1.34,
instead of the maximum 1:1.4.

Note that this is also one of those places where you would probably like
"bzr pack" to set max_repacks = 100, and have it go to town.

Interestingly, if I set compression_level = 9 I get exactly the same
compression.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhq7voACgkQJdeBCYSNAAPGkgCeNpIIn+FN1i8sUVrsrixod/ea
l2sAn2xBElupvoZ0xodHAhMxrC2mBUKq
=KHA6
-----END PGP SIGNATURE-----