brisbane-core compression selection

Robert Collins robertc at robertcollins.net
Thu Feb 12 00:14:42 GMT 2009


I've been working recently on text compression, because knits have
several issues:
 - they are slower than we'd like
 - they do not compress as well as we'd like
 - they compress along ancestry lines, which do not exist in
brisbane-core's split inventory nodes.

I've been testing the size and performance of git, knits and
groupcompress using a little bench tool I wrote -
lp:~lifeless/+junk/bzr-compressbench. groupcompress and git (with repack
at -200 for both of the how-hard-to-try options) get approximately the
same size.

We have several broad options:
 - fix knit compression to compress against arbitrary parents and use it
   (probably with some minor long outstanding serialisation fixes to be
   more efficient)
 - adapt a different implementation of the same style of compression 
   such as (revlog or xdelta) (with wrapping to fit packs)
 - use groupcompress or some other large-stream compressor(e.g. lzma)


In terms of choosing we have some work remaining:
 - test with topologically influenced corpus ordering - putting the most
   recently referenced texts at the front of packs
 - establish a clear answer for 'strip unwanted texts out of a stream' (
   if we can't answer this, using groupcompress isn't an option)
 - establish a better answer for 'the work done by git's xdelta
   implementation' - the current python implementation in dulwich is
   _much_ slower than even forking out to git to get a text back.
 - establish how well split inventory hash nodes compress using either a
   byte sequence matcher, or changing the serialiser to allow line 
   matchers to match more content.


We could also look at using LZO for the second-pass on most of the these
compressors, or bz2; we can bench revlog, we could look further afield,
but these are not necessarily good things in terms of picking a tool
today, which is the main point. We need to:
 - have brisbane-core be _much_ faster than xml inventories for log of a
   file, log of a subdirectory
 - have brisbane-core be much smaller for initial clone - 7:1 sounds 
   nice to me ;)

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090212/cd4e0a97/attachment.pgp 


More information about the bazaar mailing list