proof of concept: multiple-text compression
Robert Collins
robertc at robertcollins.net
Mon Jul 7 04:45:35 BST 2008
I'm looking for a review of my design in a new compressor I put together
on the flight. If I have missed some consideration, I'd like to know it
earlier rather than later.
lp:~lifeless/+junk/bzr-groupcompress
Design documentation:
http://bazaar.launchpad.net/~lifeless/%
2Bjunk/bzr-groupcompress/annotate/3?file_id=design-20080705181503-ccbxd6xuy1bdnrpu-2
Anyhow, some quick figures so far, testing with all the versions of
NEWS:
conglomerate (catted) 420MB
Gzip conglomerate 136433183 136.0MB
bzip conglomerate 36030207 36.0MB
knit hunks 4500832 4.5MB
{'knit-delta-gz': 1487097, 'knit-ft-gz': 3013735}
7zip conglomerate 215892 0.2MB
git 200, 200 638227 0.6MB
group compress:
patience matcher (rev2)
16665951 raw, after zlib 599036 0.6MB
custom matcher (rev3)
after zlib 360181 0.4MB
extraction is promising (0.5 seconds for the text I chose at the end, vs
0.4 seconds for bzr cat), but the cost of line splitting on 4MB of plain
text dominates the profile output: 64% is in cStringIO.readlines().
I'm going to do a binary-offset encoder next, which may inflate the size
a bit, but will remove the readlines need in the decompressor.
I don't know if this will be representative on all files; once I have a
VersionedFiles implementation passing interface tests I'm going to do a
repo format and pull bzr etc into it.
But if it *is* representative:
4.5MB:0.4MB applied to bzr's history gives:
85M:9M or so.
Which would be nice.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080707/1bc533b8/attachment.pgp
More information about the bazaar
mailing list