groupcompress+chk some more permutations
John Arbash Meinel
john at arbash-meinel.com
Fri Mar 6 05:29:58 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
> Labels/No Labels
>
> 1) I've implemented the code to move the label/sha1/etc block to the
> start of the group, and run a separate compression over that text.
>
> 2) The really nice part of this is that it makes the data fully
> self-describing, you can get a compressed groupcompress blob, and be
> able to read everything out of it. (The index is fully redundant.)
>
> 3) The downside is that it adds a fairly large overhead to the stored
> data. Somewhere between 30-50 bytes per record. This isn't a big deal
> for Texts, where the average record size is approx 1kB. However for
> Inventory pages, the average compressed size is about 100 bytes. So
> adding 30 bytes overhead is 30% bloat. Even more significant, the chk
> pages are address by a sha1, which means that they are already
> self-describing.
With a bigger conversion of the python tree (which has fewer total
files, and thus shallower tree). Also, python is a bzr-svn conversion,
so all file-ids and revision ids are 'trivially' compressible.
(svn-foo:UU_I_D-1, svn-foo:UU_I_D-2, ...)
The effect of labels is quite pronounced:
$ time wbzr repository-details python-gc-no-labels
Commits: 40821
Raw % Compressed % Objects
Revisions: 20113 KiB 0% 4361 KiB 5% 40821
Inventories: 916063 KiB 25% 18256 KiB 25% 401283
Texts: 2708390 KiB 74% 50320 KiB 68% 208741
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 3644567 KiB 100% 72939 KiB 100% 650845
Extra Info: count total avg stddev min max
internal node refs 134980 11714502 86.8 109.5 9 255
internal p_id refs 6210 595468 95.9 62.2 2 205
inv depth 212570 525137 2.5 0.5 1 3
leaf node items 212570 891514 4.2 3.4 1 11
leaf p_id items 6702 47323 7.1 9.5 1 45
p_id depth 6702 19083 2.8 0.7 1 4
real 7m40.568s
$ time wbzr repository-details python-gc-labels
Commits: 40821
Raw % Compressed % Objects
Revisions: 20113 KiB 0% 5816 KiB 6% 40821
Inventories: 916063 KiB 25% 32405 KiB 34% 401283
Texts: 2708390 KiB 74% 54516 KiB 58% 208741
Signatures: 0 KiB 0% 0 KiB 0% 0
Total: 3644567 KiB 100% 92739 KiB 100% 650845
Extra Info: count total avg stddev min max
internal node refs 134980 11714502 86.8 109.5 9 255
internal p_id refs 6210 595468 95.9 62.2 2 205
inv depth 212570 525137 2.5 0.5 1 3
leaf node items 212570 891514 4.2 3.4 1 11
leaf p_id items 6702 47323 7.1 9.5 1 45
p_id depth 6702 19083 2.8 0.7 1 4
real 9m26.691s
So that is 77% increase in size of Inventories when adding labels
(almost double). ~8.3% for texts, and 33% for revisions (though I
imagine that combining the label next to the revision would actually be
better than splitting it out, since revisions reference their own
revision id).
$ du -ksh python-gc-no-labels python-gc-labels/
88M python-gc-no-labels
107M python-gc-labels/
Now, the total effect would be lower with 'bigpage' since there would be
fewer total chk pages.
But I believe this pushes me further towards not having labels in chk
pages (we can leave them in other pages, but chk is inherently 'self
describing'.)
Interestingly, if the 30% reduction using lzma held true for the whole
python branch, we would end up at 88*.7=61MB, which is actually smaller
than the size of a python 2.6 checkout (64MB).
So at least in theory "time bzr branch code.py.org/python/trunk" would
be just about as fast as a plain checkout from subversion. (I don't know
if it compresses the fulltexts it sends, but it can't delta since you
haven't gotten a checkout yet.)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkmwtNYACgkQJdeBCYSNAAMN0gCgvRoF2hxWjQJjYzHlVDm+lvQ4
kC8AoJo686ivT0a/f/Mq+gomHAEYKuUk
=staD
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list