to group compress or not to group compress

Tue Jan 6 04:01:01 GMT 2009

So we're rapidly closing in on having the split inventory work move from
a development stage into a stabilisation and tuning phase.

One of the remaining things in it is compression of the inventory; the
split out of inventory components means that bzr can handle large trees
much more efficiently for operations like fetch and merge. However the
fragments are not currently compressed at all; we need new logic for
managing deltas; John has done some experimentation in this space last
month.

Beyond inventory compression we can also improve the compression of file
texts, which is a substantial amount of the remaining space in our
repositories.

We haven't set up metrics to assess possible solutions on - so in this
mail I'm trying to capture the space and make sure there are no things
likely to bite us right after putting effort into choosing a solution.

Use cases for delta-compression:
Repository pack storage:
 - Maximise extraction speed/minimise disk space
 - Optimise for local use
 - Be as robust as possible
 - No hard rules needed about where compression parents can be found

Network pack compression:
 - Balance network data transmitted with time to create and extract
 - Cannot refer to compression parents not already seen in a stream or
   held by the client

Annotation of texts:
 - allow reuse of line delta information if it is held in the
repository.
 - Requires extraction of many/every text in the ancestry of the 
   one being annotated.

Large file storage:
 - we need some answer for storing large files like CD images. I am
   strongly inclined to say that we'll fragment such content and thus
   never have to deal with a single large text beyond some reasonable
   size in the delta compression logic.

We have, broadly speaking, 3 different compression strategies we have
investigated:
 - single parent binary deltas [knits, xdelta]
   - application needs a single parent
   - deltas can be composed
   - merges lead to duplicate text storage in deltas
   - If the delta is along a history boundary may support annotation
     reuse.
   - individual deltas can be copied from pack to pack without
     regeneration
   - memory use is typically ~ 3 * text size at most for application
     and generation
- multiple parent binary deltas [mpdiff, knits&xdelta with 
                                  concatenated parent texts]
   - application requires N parents
   - deltas can be composed
   - merges do not lead to duplicate text in the deltas
   - individual deltas can be copied without regeneration
   - memory use is typically ~ 2 + parent-count * text size for 
     application and generation
 - full text composite compression [weaves, group compress,
   one-big-lzma]
   - deltas are embedded in a single stream - no additional IO is needed
     for extraction, but deltas cannot be separately copied.
   - delta composition is either hard or impossible
   - merges have little or no impact on delta size
   - memory use is roughly twice the size of the unique content in all 
     the texts in a single group for compression, and depending on the
     compressor up to that for extraction.

We also have several different compression parent choices we've
tested/looked into and that we can choose for both inventory components
and texts when we get to that.
Left-most-ancestor: Our current strategy, this is relatively poor
results.

Reverse-left-most-ancestor: Starting with the most recent text, this
improved the compression ratio when we tested it.

git's strategy: object type, basename (or fileid or some combination),
size, and then test many alternatives.

multiparent-ancestry: Used by bundles, this compresses better than
left-most-ancestor.

I think we have several choices we need to make.

Firstly any compression parent choice that isn't solely along ancestry
is unlikely to be able to aid annotation, so we will have a strong
dependence on fast text extraction and insertion. So we need to either
decide that we require annotation acceleration or that we will optimise
for text extraction and allow annotation to depend on that.

Secondly, what compressor should we use? We could settle on one, or
perhaps allow several for different situations. Most importantly though
is to remember the driver for all of this is performance - we want to
end up with a very fast system.

I'm inclined at this point to write a benchmarker which can take a set
of texts, a compression order, and shove them in then extract them -
similar to the index benchmarker we wrote.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090106/e50a9ede/attachment.pgp