Initial results from testing xdelta3
John Arbash Meinel
john at arbash-meinel.com
Wed May 23 10:15:44 BST 2007
John Arbash Meinel wrote:
...
> 4) Compressing over history
>
> For this test, I extracted all 1427 revisions of builtins.py and
> compressed each text against a ancestor/child.
> The total size of the texts is about 130MB. As a data point, a
> tar.bz2 of these is 5MB, and a tar.gz2 is 30MB.
> The knit in my repository is 2.6MB, and if I rebuild it it caches at
> approximately every 43rd revision for 1.6MB.
>
> child_to_parent_full_chain 477,349 7464ms
> first_parent_to_child_full_chain 13,302,931 19448ms
> parent_to_first_child_full_chain 2,989,493 10294ms
>
>
> The first mode should be having each text compressed against its LH
> parent. The second is compressing each text to the first child that
> claims it as a LH parent. And the last is compressing relative to the
> first child that claims it as any of its parents.
>
> a) We can get really good compression across the whole ancestry. (130MB
> => 477KB is about 272:1)
>
> b) For some reason compressing parents against their children is
> actually worse than compressing children against their parents.
> I would guess that extracting the child is best if it is a full text,
> but it doesn't seem to save as much space.
I did a small update to the tests, and I'm convinced that my code is
somehow picking very sub-optimal bases. I thought what I was doing was
logical, but when I changed the code to do both ways (pseudocode):
forward = encode(target, base)
backward = encode(base, target)
Then I dropped from a final size of 477KB to a final size of 273KB
(almost another 2x).
So I'll need to investigate what is going on a bit more. And in the
short term, it seems that compressing parents relative to children
really does give better results than children relative to parents.
John
=:->
More information about the bazaar
mailing list