[MERGE] documentation from the london sprint

Mon Jun 4 15:33:29 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Robert Collins wrote:
>> This is documentation from the london sprint on add, annotate, gc and
>> revert as well as some extra push-pull notes.
> 
> Oh, did you think this stuff needs discussion?  I considered it trivial.
> Maybe there are ways to improve the document, but as long as it's a true
> rendition of what we discussed in London, I'd consider it mergeable.
> 
> One thing is that these texts seem a lot wider than our 79 char
> standard.  Another is that you spelled siblings "siblins" in one place.
> 
> Also, there are some RST formatting errors:
> $ rst2html performance-roadmap.txt > perf.html
> incremental-push-pull.txt:251: (ERROR/3) Unexpected indentation.
> incremental-push-pull.txt:262: (ERROR/3) Unexpected indentation.
> performance-roadmap.txt:580: (WARNING/2) Inline strong start-string
> without end-string.
> add.txt:32: (ERROR/3) Unexpected indentation.
> performance-roadmap.txt:628: (WARNING/2) Block quote ends without a
> blank line; unexpected unindent.
> 
> In particular, your ordered list formatting isn't right.  For ReST, it
> should be:
> 
> #. First item
> #. Next item
> 
> etc.
> 
>> +If the annotation works by asking the storage layer for successive deltas
>> +within the history of the thing being annotated we believe we can make it scale
>> +broadly proportional to the depth of the tree of revisions of the annotated
>> +object.
> 
>> + * Perhaps multiparent deltas would allow us to not store the cached
>> +   annotations in each delta without losing performance or accuracy.
> 
> I think multiparent does give you that.  I think asking for
> non-multiparent deltas doesn't.
> 
> The algorithm I have for reconstructing texts in multiparent is always
> aware what revision it's pulling lines from.  But that means we continue
> to pun annotation storage with fulltext storage, which could be a loss.
> 
> One gain though, is that you can potentially use deltas in the sequence
> matching used by annotate-merge.  That would scale with the number of
> revisions and the size of the deltas, which is better than scaling with
> the square of the size of the text.

I just want to make a quick comment here about my results with testing xdelta.
I'm still finishing up a summary document (trying to write a summary takes a
lot of effort to get the important info out).

But anyway, going by ancestry allows a much better delta compression. Ignoring
all the other permutations I've tested, it is something like:

Compressing parents against their children:
   280,676 bytes (1 full=40,956, 1288 delta=239,720)

Compressing children against their parents:
   384,440 (4 full=51,394, 1285 delta=333,046)

Sorting by size of text, and compressing shorter texts against longer
   540,134 (1 full=40,956, 1288 delta=499,178)

Just compressing based on the topological sorting that is present in the knit:
   614,800 (1 full=37,897, 1288 delta=576,903)

Notice that there is more than a 2:1 improvement in compression by going
directly on ancestry, and about a 50% improvement going from parents to
children, rather than the other way around.

Also, it is fun to note that this was 1289 texts with a total fulltext size of
120MB. (This is the builtins.py knit).

Looking at it, if we go by ancestry, we can actually do better than git's
packed storage by around 2:1. (Though git has cross-file compression which we
would need to think about).

> 
>> + 1. get 'new enough' hash data for the siblins of the scope: it can be out of date as long as its not older than the last move or rename out of that siblings scope.
> 
> I don't really understand this bit about "new enough" sibling data.
> 
> Aaron

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGZCK5JdeBCYSNAAMRAqHtAJ9jBxuTe9QgXepO1m8NgbDGMuqfBQCgyUuO
KSKdNKSZczXy67dArrdeaPo=
=Hd6B
-----END PGP SIGNATURE-----