to group compress or not to group compress

Wed Jan 7 01:15:56 GMT 2009

On Tue, 2009-01-06 at 16:04 -0600, John Arbash Meinel wrote:
> 
> The problem here is that last I checked text extraction wasn't really
> the bottleneck for 'bzr annotate'. Specifically, the existing
> annotation
> code *does* extract every text. And for mysql, it was about 6-10 times
> slower when we removed the existing delta re-use.

We can look at making that faster perhaps.

> Text extraction is generally going to be a linear process (given N
> base
> lines, and M new lines, extracting a text will probably be about N +
> M).
> However computing a delta is generally an O(N^2) or at least O(NlogN)
> process.
> I would guess there are things we could do, like cache the hash dict
> info for each text, so we only compute it once, rather than on each
> evaluation. And you could even update that dict via a delta, rather
> than
> computing it for the whole text.
> 
> But in the end, speeding up text extraction is optimizing a linear
> portion, while not optimizing the quadratic portion.
> 
> So I would tend to say that regardless what we do, if we want fast
> annotation, I think we need annotation caching.

Agreed.

> > * whatever is fastest for inventory fragments
> 
> I'm not sure about "fastest" versus "smallest". Certainly if you want
> fastest at very weak compression, you could just use very small pages.
> In the 'hash trie' thread, I worked out the size of the raw inventory
> is
> roughly:
> ....
> Though I will say, the best I can come up with is still quite a bit
> larger than if you include delta compression.

So 'fast' is a balance of small data size, total IO needed, CPU process
etc. 

> It would also be informative to find out how much overhead git has in
> its inventory layers. To be honest, having one-big-file with simple
> delta compression is hard to beat as long as you allow the delta chain
> to become arbitrarily long.

For getting that full file back, yes. But for partial processing, no :).

> > * multi-parent something for text files
> > * group compress for binary files
> > 
> > If text vs binary detection isn't practical for performance reasons,
> > the rule for applying group compress could be size based, e.g.
> > any file > 1MB say.
> 
> I would actually say group compress for text files, and xdelta for
> binary files. 'gc' certainly (currently?) is a line-based delta
> format.
> Alternatively I would go to "lzma" for binary files, etc. Though I
> guess
> you are considering 'annotate' when you are going for multi-parent
> versus group compress.

I need more figures on lzma; I suspect that if it works fast enough for
binary files we wouldn't need gc at all.

> I think if we had a proper annotation cache, then the choice of text
> compression goes way down. As fundamentally you can cache intermediate
> results to to get away from the O(history_of_file) portion.

or even matching region information without having the text data at all.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090107/8d6359ce/attachment.pgp