Initial results from testing xdelta3

Tue May 22 16:19:45 BST 2007

I took some time today to play around with the xdelta3 python wrappers. 
I'll try to summarize what I found:

1) They are a bit difficult to build. The "make" process is very Linux 
specific, and I'm running on a Mac. However, I converted the SVN 
upstream into bzr, and have some branches here:
http://bzr.arbash-meinel.com/branches/xdelta

trunk is the full upstream (which is xdelta1 and xdelta3).

And my work is on:
http://bzr.arbash-meinel.com/branches/xdelta/custom/

After downloading you should be able to "python setup.py build; sudo 
python setup.py install"
Or use "python setup.py build_ext -i" to run from source.

The official ones don't have much docs, either. So I added a bit as I 
worked towards understanding what was going on.

2) xdelta3 can indeed compress without another base like gzip/bzip2. And 
it is enabled by default. It has some flags you can set to change the 
compression algorithm. In this case, though, it isn't better than 
gzip/bzip2. But it gets close.  Compressing the current text of 
builtins.py gives:

   Original size: 145,219 bytes
   bzip2:          29,546  52ms
   gzip:           36,457  61ms
   xdelta no comp:124,725   6ms
   xdelta comp 1:  53,489  25ms
   xdelta default: 47,785  79ms
   xdelta comp 9:  45,529 140ms

Comments:
   Default compression looks like COMP_6 (which is what gzip uses, too)

   With this file, depending on python inconsistencies, bzip2 actually
   beats gzip both in size and speed.

   Default compression (or COMP_1) seems a reasonable tradeoff from time
   and space. It is close to gzip in both size and space.

   It is close enough that I wouldn't really recommend using no comp and
   then compressing using gzip, or something like that.

   NO_COMPRESSION actually does a bit of compression, because it still
   finds common sub-strings. And it seems to do it really quickly.

   There are more flags to try (like a variable bit-length encoding, etc)

Decompression:

   test_bzip2_decompress                          OK       16ms
   test_gzip_decompress                           OK        6ms
   test_simple_decompress_COMPLEVEL_1             OK        4ms
   test_simple_decompress_COMPLEVEL_6             OK        4ms
   test_simple_decompress_NO_COMPRESS             OK        2ms
   test_simple_decompress_NO_FLAGS                OK        4ms
   test_simple_decompress_no_source_COMPLEVEL_9   OK        4ms

   Not very interesting. xdelta is a bit faster than gzip (especially
   with NO_COMPRESS), but everything but bzip2 is so fast it isn't really
   an issue. (These numbers seem consistent, but are within a noise floor
   on this machine)

3) Using xdelta to compress texts. I used the last commit of builtins.py
    versus its parent. For patience_diff and difflib, I used our current
    Knit serialization mode. (text start,end,count\ntext_lines):

   Mode                       delta time
   test_compress_COMPLEVEL_1   248   4ms
   test_compress_COMPLEVEL_6   222   6ms
   test_compress_COMPLEVEL_9   217   7ms
   test_compress_NO_COMPRESS   269   3ms
   test_compress_NO_FLAGS      222   6ms
   test_compress_difflib_diff  618 103ms
   test_compress_patience_diff 618 134ms

   Basically, xdelta beats the pants of the python diff routines. It
   completes in approx 15-20x faster, and compresses to 1/3rd the size.
   I attribute that to both mostly to working in raw strings (rather than
   lines). And the output is a bit stream, rather than ascii text.

For extracting them to get the original text back, the numbers aren't 
quite as amazing:

   test_decompress_COMPLEVEL_1             2ms
   test_decompress_COMPLEVEL_6             2ms
   test_decompress_NO_COMPRESS             2ms
   test_decompress_NO_FLAGS                2ms
   test_decompress_difflib_diff            5ms
   test_decompress_no_source_COMPLEVEL_9   2ms
   test_decompress_patience_diff           5ms

For the python implementations, that includes the time to .split() and 
.join() after they are done. Which I would imagine is the bulk of the time.

4) Compressing over history

    For this test, I extracted all 1427 revisions of builtins.py and
    compressed each text against a ancestor/child.
    The total size of the texts is about 130MB. As a data point, a
    tar.bz2 of these is 5MB, and a tar.gz2 is 30MB.
    The knit in my repository is 2.6MB, and if I rebuild it it caches at
    approximately every 43rd revision for 1.6MB.

   child_to_parent_full_chain          477,349  7464ms
   first_parent_to_child_full_chain 13,302,931 19448ms
   parent_to_first_child_full_chain  2,989,493 10294ms

The first mode should be having each text compressed against its LH 
parent. The second is compressing each text to the first child that 
claims it as a LH parent. And the last is compressing relative to the 
first child that claims it as any of its parents.

a) We can get really good compression across the whole ancestry. (130MB
    => 477KB is about 272:1)

b) For some reason compressing parents against their children is
    actually worse than compressing children against their parents.
    I would guess that extracting the child is best if it is a full text,
    but it doesn't seem to save as much space.

Now it is possible that my code is incorrect, and it is doing something 
weird. I did check the first few base checks, and it did seem to do the 
right thing. (the first text was delta'd against its only child, etc).

I'm certainly planning on looking into it more thoroughly, but this is 
the result I have so far.

John
=:->

PS> The plugin is available from:
http://bzr.arbash-meinel.com/plugins/xdelta_test