Initial results from testing xdelta3

Wed May 23 10:15:44 BST 2007

John Arbash Meinel wrote:
...

> 4) Compressing over history
> 
>    For this test, I extracted all 1427 revisions of builtins.py and
>    compressed each text against a ancestor/child.
>    The total size of the texts is about 130MB. As a data point, a
>    tar.bz2 of these is 5MB, and a tar.gz2 is 30MB.
>    The knit in my repository is 2.6MB, and if I rebuild it it caches at
>    approximately every 43rd revision for 1.6MB.
> 
>   child_to_parent_full_chain          477,349  7464ms
>   first_parent_to_child_full_chain 13,302,931 19448ms
>   parent_to_first_child_full_chain  2,989,493 10294ms
> 
> 
> The first mode should be having each text compressed against its LH 
> parent. The second is compressing each text to the first child that 
> claims it as a LH parent. And the last is compressing relative to the 
> first child that claims it as any of its parents.
> 
> a) We can get really good compression across the whole ancestry. (130MB
>    => 477KB is about 272:1)
> 
> b) For some reason compressing parents against their children is
>    actually worse than compressing children against their parents.
>    I would guess that extracting the child is best if it is a full text,
>    but it doesn't seem to save as much space.

I did a small update to the tests, and I'm convinced that my code is 
somehow picking very sub-optimal bases. I thought what I was doing was 
logical, but when I changed the code to do both ways (pseudocode):

forward = encode(target, base)
backward = encode(base, target)

Then I dropped from a final size of 477KB to a final size of 273KB 
(almost another 2x).

So I'll need to investigate what is going on a bit more. And in the 
short term, it seems that compressing parents relative to children 
really does give better results than children relative to parents.

John
=:->