Initial results using 'hash trie'

Thu Dec 25 17:48:16 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Next I might try just bumping up the max leaf node size, and see what
> happens. It will probably cause the raw size to go up, but I would guess
> the compressed size wouldn't change much.
> 
> John
> =:->

This seems pretty promising, in my "1k revs" test.

Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%      1228 KiB   7%     1043
Inventories:    46174 KiB   4%      3425 KiB  21%    11539
Texts:         882272 KiB  94%     11174 KiB  70%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         932437 KiB 100%     15828 KiB 100%    19808

Extra Info:           count    total    avg stddev  min  max
internal node refs     4253    59712   14.0    3.4    8   16
internal p_id refs      236     2279    9.7    2.4    8   14
inv depth              5747    17239    3.0    0.0    1    3
leaf node items        5747   123550   21.5    4.5    4   39
leaf p_id items         260    56773  218.4  147.3    1  516
p_id depth              260      619    2.4    0.5    1    3

The compressed size has gotten *better*, and as expected the 'raw' size
went up (46k instead of 30k).

Average depth for a 16-way fan out is 3, which seems good. (Previous
16-way fan out was closer to 4, which is also expected since 64k page
holds 16 times more than the 4k page.)

In looking at the index info, the .cix is now about 2 times the size of
the .tix, but that is favorable when compared with the previous form's
5x. Number of inventory objects per commit is around 11, versus 12 with
the 255-way fan out. So not really better there. I'm not sure how to
explain that versus the improvement in the .cix. (It may just be too
early for .cix to be properly measured.)

I should also mention that all of my 'delta' testing has been done with
a modified inventory format. Which uses '\n' as the separator, rather
than '\0', and puts the 'likely-to-change' bits together. Which should
allow 'knit delta' to be close-to-optimal. Since it can just include the
3 lines that change with every commit (revision, sha, size).

It would also be interesting to try other delta compression techniques.
Consider that zlib across all hunks in a given revision will probably do
better, because the 'revision' attribute can be combined.

Without extracting a base text, I don't see GroupCompress doing very
well during an incremental conversion. Though it is certainly possible
that "bzr pack" could be optimized for GC, and then I would guess it
could do quite well. Ordering in reverse-topological should be quite
good for an inventory. Consider that the latest one is the sum of all
merges, and as each one 'undoes' a single change, all of the previous
inventories should have a lot to share.

I certainly expect it could be a lot better than "one direct parent"
compression. Though I would guess you have far more "modification"
rather than actual "removal" of lines. So actual reverse deltas aren't
going to be much smaller than forward deltas.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklTx2AACgkQJdeBCYSNAAN3FQCggLXQwriY4mgA/IySWlv57G59
YdgAoMpCha+ffXwyew6xjkP8axKl7Iob
=yo+S
-----END PGP SIGNATURE-----