brisbane-core compression selection
Robert Collins
robertc at robertcollins.net
Thu Feb 12 00:14:42 GMT 2009
I've been working recently on text compression, because knits have
several issues:
- they are slower than we'd like
- they do not compress as well as we'd like
- they compress along ancestry lines, which do not exist in
brisbane-core's split inventory nodes.
I've been testing the size and performance of git, knits and
groupcompress using a little bench tool I wrote -
lp:~lifeless/+junk/bzr-compressbench. groupcompress and git (with repack
at -200 for both of the how-hard-to-try options) get approximately the
same size.
We have several broad options:
- fix knit compression to compress against arbitrary parents and use it
(probably with some minor long outstanding serialisation fixes to be
more efficient)
- adapt a different implementation of the same style of compression
such as (revlog or xdelta) (with wrapping to fit packs)
- use groupcompress or some other large-stream compressor(e.g. lzma)
In terms of choosing we have some work remaining:
- test with topologically influenced corpus ordering - putting the most
recently referenced texts at the front of packs
- establish a clear answer for 'strip unwanted texts out of a stream' (
if we can't answer this, using groupcompress isn't an option)
- establish a better answer for 'the work done by git's xdelta
implementation' - the current python implementation in dulwich is
_much_ slower than even forking out to git to get a text back.
- establish how well split inventory hash nodes compress using either a
byte sequence matcher, or changing the serialiser to allow line
matchers to match more content.
We could also look at using LZO for the second-pass on most of the these
compressors, or bz2; we can bench revlog, we could look further afield,
but these are not necessarily good things in terms of picking a tool
today, which is the main point. We need to:
- have brisbane-core be _much_ faster than xml inventories for log of a
file, log of a subdirectory
- have brisbane-core be much smaller for initial clone - 7:1 sounds
nice to me ;)
-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090212/cd4e0a97/attachment.pgp
More information about the bazaar
mailing list