Better compression
raindog at macrohmasheen.com
raindog at macrohmasheen.com
Wed Jul 16 18:52:06 BST 2008
What level perf increase is seen with this? Do you have any comparissons?
Sent from my Verizon Wireless BlackBerry
-----Original Message-----
From: Robert Collins <robertc at robertcollins.net>
Date: Wed, 16 Jul 2008 03:24:09
To: Bazaar<bazaar at lists.canonical.com>
Subject: Better compression
So, I've been pursuing a concept to get better compression in bzr. I now
have some figures to report.
Note that its not tuned yet. (And perhaps we should use xdelta or
something with tuning.) It probably needs a C layer too; we could use
python buffers in the decompressor to get zero-copy etc.
Executive summary
-----------------
I have a prototype repository format with a smaller size than git at
repack --window 200, --depth 200. This format currently has:
- ~ the same text extraction time we have today for single texts.
- no scatter gather IO - any text needs precisely one index query
and one read
- no accelerated left_matching_blocks
- memory capped usage; regardless of repository size or number of
texts requested at once the versioned file subsystem and index
subsystem will not blow out on memory (but the size-of-single-file
constraint still applies).
- formats compatible with plain, rich-roots and subtrees for testing
I think 1.7 is a feasible time to deliver a usable 'unqualified
improvement' format with this.
This size/performance tradeoff here is mutable - we can lower
compression and decompression time by building up smaller compression
groups.
The code is at (install all these plugins):
http://bazaar.launchpad.net/~lifeless/+junk/bzr-groupcompress
http://bazaar.launchpad.net/~lifeless/+junk/bzr-index2
http://bazaar.launchpad.net/~jameinel/+junk/bzr-pybloom
e.g.
mkdir -p ~/.bazaar/plugins
http://bazaar.launchpad.net/~lifeless/+junk/bzr-groupcompress ~/.bazaar/plugins/groupcompress
http://bazaar.launchpad.net/~lifeless/+junk/bzr-index2 ~/.bazaar/plugins/index2
http://bazaar.launchpad.net/~jameinel/+junk/bzr-pybloom ~/.bazaar/plugins/pybloom
Design overview
---------------
The new compressor works by building up a single blob of output
containing many texts compressed against the preceeding bytes in the
output. That is then entropy-coded (e.g. zlib at the moment). If the
total size of the output before-entropy-coding exceeds some threshold
(currently twenty MB) the compression pass is completed, and a new
compressor object created.
This compressor is hooked into bzrlib by implementing a VersionedFiles
object type that uses the compressor to insert everything found during a
single insert_record_stream method call.
The compressor works best when it starts with a text that contains most
of the strings that will be needed by successive texts so that they can
copy the strings in large ranges, rather than having to copy them in
many small ranges. (E.g. grouped by fileid, then
most-recent-commit-first). When given its worst-case, it can output more
bytes to the packs than our current knit compression engine.
Some things that could be done if you want to help
--------------------------------------------------
Integration of btree_index into bzrlib's core and the associated upgrade
etc logic (without the blooms etc - just a lean,mean btree index
facility) - get that through review. (I considered doing this for 1.6
but it would just delay the release). This can be done now, as
btree_index objects are IMO an unqualified improvement over GraphIndex.
This would be a huge benefit to the new compressor project.
C implementation of the compressor and extractor logic
fetching-from-knits optimisation (custom InterObject to fetch in
reverse-topological order).
Hammer on it and report issues.
Progress bars!
Integration of this as a patch to bzr.dev rather than a plugin. (Depends
on btree index making it in).
A rigorous comparison of this against bundles and the other previously
studied compression schemes. I suspect that this (to quote Aaron)
'smokes' the other compressors we've looked at.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
More information about the bazaar
mailing list