Initial results using 'hash trie'
John Arbash Meinel
john at arbash-meinel.com
Thu Dec 25 17:48:16 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
> Next I might try just bumping up the max leaf node size, and see what
> happens. It will probably cause the raw size to go up, but I would guess
> the compressed size wouldn't change much.
>
> John
> =:->
This seems pretty promising, in my "1k revs" test.
Commits: 1043
Raw % Compressed % Objects
Revisions: 3990 KiB 0% 1228 KiB 7% 1043
Inventories: 46174 KiB 4% 3425 KiB 21% 11539
Texts: 882272 KiB 94% 11174 KiB 70% 7226
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 932437 KiB 100% 15828 KiB 100% 19808
Extra Info: count total avg stddev min max
internal node refs 4253 59712 14.0 3.4 8 16
internal p_id refs 236 2279 9.7 2.4 8 14
inv depth 5747 17239 3.0 0.0 1 3
leaf node items 5747 123550 21.5 4.5 4 39
leaf p_id items 260 56773 218.4 147.3 1 516
p_id depth 260 619 2.4 0.5 1 3
The compressed size has gotten *better*, and as expected the 'raw' size
went up (46k instead of 30k).
Average depth for a 16-way fan out is 3, which seems good. (Previous
16-way fan out was closer to 4, which is also expected since 64k page
holds 16 times more than the 4k page.)
In looking at the index info, the .cix is now about 2 times the size of
the .tix, but that is favorable when compared with the previous form's
5x. Number of inventory objects per commit is around 11, versus 12 with
the 255-way fan out. So not really better there. I'm not sure how to
explain that versus the improvement in the .cix. (It may just be too
early for .cix to be properly measured.)
I should also mention that all of my 'delta' testing has been done with
a modified inventory format. Which uses '\n' as the separator, rather
than '\0', and puts the 'likely-to-change' bits together. Which should
allow 'knit delta' to be close-to-optimal. Since it can just include the
3 lines that change with every commit (revision, sha, size).
It would also be interesting to try other delta compression techniques.
Consider that zlib across all hunks in a given revision will probably do
better, because the 'revision' attribute can be combined.
Without extracting a base text, I don't see GroupCompress doing very
well during an incremental conversion. Though it is certainly possible
that "bzr pack" could be optimized for GC, and then I would guess it
could do quite well. Ordering in reverse-topological should be quite
good for an inventory. Consider that the latest one is the sum of all
merges, and as each one 'undoes' a single change, all of the previous
inventories should have a lot to share.
I certainly expect it could be a lot better than "one direct parent"
compression. Though I would guess you have far more "modification"
rather than actual "removal" of lines. So actual reverse deltas aren't
going to be much smaller than forward deltas.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklTx2AACgkQJdeBCYSNAAN3FQCggLXQwriY4mgA/IySWlv57G59
YdgAoMpCha+ffXwyew6xjkP8axKl7Iob
=yo+S
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list