[MERGE] documentation from the london sprint

Tue Jun 5 05:06:35 BST 2007

On Mon, 2007-06-04 at 09:33 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> I just want to make a quick comment here about my results with testing xdelta.
> I'm still finishing up a summary document (trying to write a summary takes a
> lot of effort to get the important info out).
> 
> But anyway, going by ancestry allows a much better delta compression. Ignoring
> all the other permutations I've tested, it is something like:
> 
> Compressing parents against their children:
>    280,676 bytes (1 full=40,956, 1288 delta=239,720)
> 
> Compressing children against their parents:
>    384,440 (4 full=51,394, 1285 delta=333,046)
> 
> Sorting by size of text, and compressing shorter texts against longer
>    540,134 (1 full=40,956, 1288 delta=499,178)
> 
> Just compressing based on the topological sorting that is present in the knit:
>    614,800 (1 full=37,897, 1288 delta=576,903)
> 
> 
> Notice that there is more than a 2:1 improvement in compression by going
> directly on ancestry, and about a 50% improvement going from parents to
> children, rather than the other way around.
> 
> Also, it is fun to note that this was 1289 texts with a total fulltext size of
> 120MB. (This is the builtins.py knit).
> 
> Looking at it, if we go by ancestry, we can actually do better than git's
> packed storage by around 2:1. (Though git has cross-file compression which we
> would need to think about).

I'm very interested in layering here. Way back we had a 'put full texts
in, get full texts out' api, and we found performance issues,
specifically related to double-handling. My thoughts with xdelta or
whatever are that we can decouple diffs being done for annotations from
those being done for storage; the reason to do that is to optimise the
read patterns - as we read things much more often than write.

In the proposal you can do ancestry related compression, but I think
we'd want to do that only loosely; I have a hunch (needs testing) that
the locality of reference win (allowing good hit rates on optimistic
block reads with less index data that we need now) from linear patch
application is significantly more than the win from reduced IO and
smaller disk size (though smaller disk size is important, and we should
not discount it)... as long as we can tune this at runtime we can
experiment I guess.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070605/8001ffa2/attachment.pgp