to group compress or not to group compress

Tue Jan 6 22:04:35 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Robert Collins wrote:
> 
>> We haven't set up metrics to assess possible solutions on - so in this
>> mail I'm trying to capture the space and make sure there are no things
>> likely to bite us right after putting effort into choosing a solution.
> 
> Speed of common operations is top of the list of metrics. We also rightly
> get assessed on total disk space usage, particularly for large projects,
> so we should include it as well.
> 

I agree on both points. I don't really have a good idea of the relative
weight.

...

>> We have, broadly speaking, 3 different compression strategies we have
>> investigated:
>>  - single parent binary deltas [knits, xdelta]
>>    - application needs a single parent
>>    - deltas can be composed
>>    - merges lead to duplicate text storage in deltas
>>    - If the delta is along a history boundary may support annotation
>>      reuse.
>>    - individual deltas can be copied from pack to pack without
>>      regeneration
>>    - memory use is typically ~ 3 * text size at most for application
>>      and generation
>> - multiple parent binary deltas [mpdiff, knits&xdelta with 
>>                                   concatenated parent texts]
>>    - application requires N parents
>>    - deltas can be composed
>>    - merges do not lead to duplicate text in the deltas
>>    - individual deltas can be copied without regeneration
>>    - memory use is typically ~ 2 + parent-count * text size for 
>>      application and generation
> 
> You didn't mention (potential) annotation reuse under multiple
> parent. Was that implied?

Well, we can't use it in the existing framework, but you're probably
right that we could hypothetically use it. And it may end up better than
the existing implementation.

> 
>> I think we have several choices we need to make.
>>
>> Firstly any compression parent choice that isn't solely along ancestry
>> is unlikely to be able to aid annotation, so we will have a strong
>> dependence on fast text extraction and insertion. So we need to either
>> decide that we require annotation acceleration or that we will optimise
>> for text extraction and allow annotation to depend on that.
>>

The problem here is that last I checked text extraction wasn't really
the bottleneck for 'bzr annotate'. Specifically, the existing annotation
code *does* extract every text. And for mysql, it was about 6-10 times
slower when we removed the existing delta re-use.

Text extraction is generally going to be a linear process (given N base
lines, and M new lines, extracting a text will probably be about N + M).
However computing a delta is generally an O(N^2) or at least O(NlogN)
process.
I would guess there are things we could do, like cache the hash dict
info for each text, so we only compute it once, rather than on each
evaluation. And you could even update that dict via a delta, rather than
computing it for the whole text.

But in the end, speeding up text extraction is optimizing a linear
portion, while not optimizing the quadratic portion.

So I would tend to say that regardless what we do, if we want fast
annotation, I think we need annotation caching.

Also, note that Git is *not* particularly fast at annotation. The files
we were comparing in the git-vs-bzr annotation discussion had <100
revisions. (and just happened to have a rename in the history).

The files we are talking about with mysql are significantly larger and
have 2k+ revisions.

>> Secondly, what compressor should we use? We could settle on one, or
>> perhaps allow several for different situations. Most importantly though
>> is to remember the driver for all of this is performance - we want to
>> end up with a very fast system.
> 
> Conceptually, I like the thought of:
> 
> * whatever is fastest for inventory fragments

I'm not sure about "fastest" versus "smallest". Certainly if you want
fastest at very weak compression, you could just use very small pages.
In the 'hash trie' thread, I worked out the size of the raw inventory is
roughly:

50 bytes * (entries in root node
 + average number of changes * (entries in intermediate node *
                                intermediate depth
				+ 6 * entries in leaf node)
)

The 50 bytes is the size of overhead + a sha hash. 6 is because a leaf
inventory entry is about 6 times larger than an internal node entry. The
MySQL tree is very close to an "average number of changes" of 5.

"intermediate depth" can be thought of as a function:

  intermediate depth = log(tree size / (entries in leaf
                                        * entries in root))
		       / log(entries in intermediate)

In general the variables are quite intermixed, and a bit hard to tease
apart. But you still have a whole lot of parameters that you can tweak.
We don't have a lot of numbers (yet) on how these parameters effect
performance, but in pretty much all cases we still gain the benefit that
comparing 2 trees will be O(logN) rather than O(N). The only thing that
really changes is the base of the log and the coefficients (which IME
can still be rather important, considering the old and new indexes are
both O(logN) but one is more log_2 N and the other log_100 N).

The big trick would be to balance

entries in root
== avg num changes * entries in intermediate * intermediate depth
== avg num changes * 6 * entries in leaf

Though I will say, the best I can come up with is still quite a bit
larger than if you include delta compression.

It would also be informative to find out how much overhead git has in
its inventory layers. To be honest, having one-big-file with simple
delta compression is hard to beat as long as you allow the delta chain
to become arbitrarily long.

Of course, in git's favor is that an inventory record is always ~30
bytes (name + binary sha hash). While in bzr an inventory record is 300
bytes (pretty much all ascii, and we have a hash, and a revision, and a
couple file ids). Ours gets a bigger benefit from zlib/gzip compression,
so the final result isn't going to be the full 10:1, but at a minimum
they have a single handle as a sha hash, while we have a sha hash *and*
a file_id *and* a revision.

> * multi-parent something for text files
> * group compress for binary files
> 
> If text vs binary detection isn't practical for performance reasons,
> the rule for applying group compress could be size based, e.g.
> any file > 1MB say.

I would actually say group compress for text files, and xdelta for
binary files. 'gc' certainly (currently?) is a line-based delta format.
Alternatively I would go to "lzma" for binary files, etc. Though I guess
you are considering 'annotate' when you are going for multi-parent
versus group compress.

I think if we had a proper annotation cache, then the choice of text
compression goes way down. As fundamentally you can cache intermediate
results to to get away from the O(history_of_file) portion.

> 
> It would be damn sweet if we could make it easy for developers to
> create new formats using a different heuristic, e.g. always use
> group compress for everything when the underlying application
> is version management under a photo editor, say! (Clearly not the
> main game but using bzrlib as a developer toolkit is something I
> expect to see more and more of.)
> 
>> I'm inclined at this point to write a benchmarker which can take a set
>> of texts, a compression order, and shove them in then extract them -
>> similar to the index benchmarker we wrote.
> 
> I can't see how we can make the right choices without this tool.
> Go for it, I say!
> 
> Ian C.
> 
> 

Of course, the big difficulty here is writing such a tool, and then
selecting an appropriate corpus of texts to use, etc. I did write part
of that tool when I was evaluating whether we wanted to use "xdelta".
There are just a huge number of knobs you can tweak. For example, you
probably should weigh 'recent' texts more heavily than old texts, and
how much do you weigh 'annotate' performance versus other performance.
And what happens when you go as far as "cross-file" deltas. I don't
think it is true for all projects, but for some cross-file is really a
killer feature on getting the total size to be small. (I think the
kernel is one of those, because it has a lot of boiler-plate in the
various arch directories.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklj1XMACgkQJdeBCYSNAAPoFQCfczTfIlUHd2+nmu5unJo0rNIR
u6gAn2VvFXLOa2MDZRtXh+xkR+ywBwAq
=buLR
-----END PGP SIGNATURE-----