groupcompress+chk some more permutations

Fri Mar 6 05:29:58 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Labels/No Labels
> 
> 1) I've implemented the code to move the label/sha1/etc block to the
> start of the group, and run a separate compression over that text.
> 
> 2) The really nice part of this is that it makes the data fully
> self-describing, you can get a compressed groupcompress blob, and be
> able to read everything out of it. (The index is fully redundant.)
> 
> 3) The downside is that it adds a fairly large overhead to the stored
> data. Somewhere between 30-50 bytes per record. This isn't a big deal
> for Texts, where the average record size is approx 1kB. However for
> Inventory pages, the average compressed size is about 100 bytes. So
> adding 30 bytes overhead is 30% bloat. Even more significant, the chk
> pages are address by a sha1, which means that they are already
> self-describing.

With a bigger conversion of the python tree (which has fewer total
files, and thus shallower tree). Also, python is a bzr-svn conversion,
so all file-ids and revision ids are 'trivially' compressible.
(svn-foo:UU_I_D-1, svn-foo:UU_I_D-2, ...)

The effect of labels is quite pronounced:

$ time wbzr repository-details python-gc-no-labels
Commits: 40821
                      Raw    %    Compressed    %  Objects
Revisions:      20113 KiB   0%      4361 KiB   5%    40821
Inventories:   916063 KiB  25%     18256 KiB  25%   401283
Texts:        2708390 KiB  74%     50320 KiB  68%   208741
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        3644567 KiB 100%     72939 KiB 100%   650845

Extra Info:           count    total    avg stddev  min  max
internal node refs   134980 11714502   86.8  109.5    9  255
internal p_id refs     6210   595468   95.9   62.2    2  205
inv depth            212570   525137    2.5    0.5    1    3
leaf node items      212570   891514    4.2    3.4    1   11
leaf p_id items        6702    47323    7.1    9.5    1   45
p_id depth             6702    19083    2.8    0.7    1    4
real    7m40.568s

$ time wbzr repository-details python-gc-labels
Commits: 40821
                      Raw    %    Compressed    %  Objects
Revisions:      20113 KiB   0%      5816 KiB   6%    40821
Inventories:   916063 KiB  25%     32405 KiB  34%   401283
Texts:        2708390 KiB  74%     54516 KiB  58%   208741
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        3644567 KiB 100%     92739 KiB 100%   650845

Extra Info:           count    total    avg stddev  min  max
internal node refs   134980 11714502   86.8  109.5    9  255
internal p_id refs     6210   595468   95.9   62.2    2  205
inv depth            212570   525137    2.5    0.5    1    3
leaf node items      212570   891514    4.2    3.4    1   11
leaf p_id items        6702    47323    7.1    9.5    1   45
p_id depth             6702    19083    2.8    0.7    1    4
real    9m26.691s

So that is 77% increase in size of Inventories when adding labels
(almost double). ~8.3% for texts, and 33% for revisions (though I
imagine that combining the label next to the revision would actually be
better than splitting it out, since revisions reference their own
revision id).

$ du -ksh python-gc-no-labels python-gc-labels/
88M     python-gc-no-labels
107M    python-gc-labels/

Now, the total effect would be lower with 'bigpage' since there would be
fewer total chk pages.

But I believe this pushes me further towards not having labels in chk
pages (we can leave them in other pages, but chk is inherently 'self
describing'.)

Interestingly, if the 30% reduction using lzma held true for the whole
python branch, we would end up at 88*.7=61MB, which is actually smaller
than the size of a python 2.6 checkout (64MB).

So at least in theory "time bzr branch code.py.org/python/trunk" would
be just about as fast as a plain checkout from subversion. (I don't know
if it compresses the fulltexts it sends, but it can't delta since you
haven't gotten a checkout yet.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmwtNYACgkQJdeBCYSNAAMN0gCgvRoF2hxWjQJjYzHlVDm+lvQ4
kC8AoJo686ivT0a/f/Mq+gomHAEYKuUk
=staD
-----END PGP SIGNATURE-----