to group compress or not to group compress

Tue Jan 6 10:56:36 GMT 2009

Robert Collins wrote:

> We haven't set up metrics to assess possible solutions on - so in this
> mail I'm trying to capture the space and make sure there are no things
> likely to bite us right after putting effort into choosing a solution.

Speed of common operations is top of the list of metrics. We also rightly
get assessed on total disk space usage, particularly for large projects,
so we should include it as well.

> Use cases for delta-compression:

> Large file storage:
>  - we need some answer for storing large files like CD images. I am
>    strongly inclined to say that we'll fragment such content and thus
>    never have to deal with a single large text beyond some reasonable
>    size in the delta compression logic.

I don't have any stats to back this claim up but I feel that 99.9% of
large files are binary: images/photos, music clips, CD images, zips,
databases, etc. For these, issues like fast annotate and diff
don't matter but speed of extraction and storage efficiency increase
in importance.

> We have, broadly speaking, 3 different compression strategies we have
> investigated:
>  - single parent binary deltas [knits, xdelta]
>    - application needs a single parent
>    - deltas can be composed
>    - merges lead to duplicate text storage in deltas
>    - If the delta is along a history boundary may support annotation
>      reuse.
>    - individual deltas can be copied from pack to pack without
>      regeneration
>    - memory use is typically ~ 3 * text size at most for application
>      and generation
> - multiple parent binary deltas [mpdiff, knits&xdelta with 
>                                   concatenated parent texts]
>    - application requires N parents
>    - deltas can be composed
>    - merges do not lead to duplicate text in the deltas
>    - individual deltas can be copied without regeneration
>    - memory use is typically ~ 2 + parent-count * text size for 
>      application and generation

You didn't mention (potential) annotation reuse under multiple
parent. Was that implied?

> I think we have several choices we need to make.
> 
> Firstly any compression parent choice that isn't solely along ancestry
> is unlikely to be able to aid annotation, so we will have a strong
> dependence on fast text extraction and insertion. So we need to either
> decide that we require annotation acceleration or that we will optimise
> for text extraction and allow annotation to depend on that.
> 
> Secondly, what compressor should we use? We could settle on one, or
> perhaps allow several for different situations. Most importantly though
> is to remember the driver for all of this is performance - we want to
> end up with a very fast system.

Conceptually, I like the thought of:

* whatever is fastest for inventory fragments
* multi-parent something for text files
* group compress for binary files

If text vs binary detection isn't practical for performance reasons,
the rule for applying group compress could be size based, e.g.
any file > 1MB say.

It would be damn sweet if we could make it easy for developers to
create new formats using a different heuristic, e.g. always use
group compress for everything when the underlying application
is version management under a photo editor, say! (Clearly not the
main game but using bzrlib as a developer toolkit is something I
expect to see more and more of.)

> I'm inclined at this point to write a benchmarker which can take a set
> of texts, a compression order, and shove them in then extract them -
> similar to the index benchmarker we wrote.

I can't see how we can make the right choices without this tool.
Go for it, I say!

Ian C.