proof of concept: multiple-text compression

Mon Jul 7 04:45:35 BST 2008

I'm looking for a review of my design in a new compressor I put together
on the flight. If I have missed some consideration, I'd like to know it
earlier rather than later.

lp:~lifeless/+junk/bzr-groupcompress

Design documentation:
http://bazaar.launchpad.net/~lifeless/%
2Bjunk/bzr-groupcompress/annotate/3?file_id=design-20080705181503-ccbxd6xuy1bdnrpu-2

Anyhow, some quick figures so far, testing with all the versions of
NEWS:

conglomerate (catted)       420MB
Gzip conglomerate 136433183 136.0MB
bzip conglomerate  36030207  36.0MB
knit hunks          4500832   4.5MB   
    {'knit-delta-gz': 1487097, 'knit-ft-gz': 3013735}
7zip conglomerate    215892   0.2MB
git  200, 200        638227   0.6MB

group compress:
patience matcher (rev2)
16665951 raw, after zlib 599036 0.6MB
custom matcher (rev3) 
            after zlib 360181 0.4MB

extraction is promising (0.5 seconds for the text I chose at the end, vs
0.4 seconds for bzr cat), but the cost of line splitting on 4MB of plain
text dominates the profile output: 64% is in cStringIO.readlines().

I'm going to do a binary-offset encoder next, which may inflate the size
a bit, but will remove the readlines need in the decompressor.

I don't know if this will be representative on all files; once I have a
VersionedFiles implementation passing interface tests I'm going to do a
repo format and pull bzr etc into it.

But if it *is* representative:
4.5MB:0.4MB applied to bzr's history gives:
85M:9M or so.

Which would be nice.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080707/1bc533b8/attachment.pgp