Initial results from testing xdelta3
John Arbash Meinel
john at arbash-meinel.com
Tue May 22 16:19:45 BST 2007
I took some time today to play around with the xdelta3 python wrappers.
I'll try to summarize what I found:
1) They are a bit difficult to build. The "make" process is very Linux
specific, and I'm running on a Mac. However, I converted the SVN
upstream into bzr, and have some branches here:
http://bzr.arbash-meinel.com/branches/xdelta
trunk is the full upstream (which is xdelta1 and xdelta3).
And my work is on:
http://bzr.arbash-meinel.com/branches/xdelta/custom/
After downloading you should be able to "python setup.py build; sudo
python setup.py install"
Or use "python setup.py build_ext -i" to run from source.
The official ones don't have much docs, either. So I added a bit as I
worked towards understanding what was going on.
2) xdelta3 can indeed compress without another base like gzip/bzip2. And
it is enabled by default. It has some flags you can set to change the
compression algorithm. In this case, though, it isn't better than
gzip/bzip2. But it gets close. Compressing the current text of
builtins.py gives:
Original size: 145,219 bytes
bzip2: 29,546 52ms
gzip: 36,457 61ms
xdelta no comp:124,725 6ms
xdelta comp 1: 53,489 25ms
xdelta default: 47,785 79ms
xdelta comp 9: 45,529 140ms
Comments:
Default compression looks like COMP_6 (which is what gzip uses, too)
With this file, depending on python inconsistencies, bzip2 actually
beats gzip both in size and speed.
Default compression (or COMP_1) seems a reasonable tradeoff from time
and space. It is close to gzip in both size and space.
It is close enough that I wouldn't really recommend using no comp and
then compressing using gzip, or something like that.
NO_COMPRESSION actually does a bit of compression, because it still
finds common sub-strings. And it seems to do it really quickly.
There are more flags to try (like a variable bit-length encoding, etc)
Decompression:
test_bzip2_decompress OK 16ms
test_gzip_decompress OK 6ms
test_simple_decompress_COMPLEVEL_1 OK 4ms
test_simple_decompress_COMPLEVEL_6 OK 4ms
test_simple_decompress_NO_COMPRESS OK 2ms
test_simple_decompress_NO_FLAGS OK 4ms
test_simple_decompress_no_source_COMPLEVEL_9 OK 4ms
Not very interesting. xdelta is a bit faster than gzip (especially
with NO_COMPRESS), but everything but bzip2 is so fast it isn't really
an issue. (These numbers seem consistent, but are within a noise floor
on this machine)
3) Using xdelta to compress texts. I used the last commit of builtins.py
versus its parent. For patience_diff and difflib, I used our current
Knit serialization mode. (text start,end,count\ntext_lines):
Mode delta time
test_compress_COMPLEVEL_1 248 4ms
test_compress_COMPLEVEL_6 222 6ms
test_compress_COMPLEVEL_9 217 7ms
test_compress_NO_COMPRESS 269 3ms
test_compress_NO_FLAGS 222 6ms
test_compress_difflib_diff 618 103ms
test_compress_patience_diff 618 134ms
Basically, xdelta beats the pants of the python diff routines. It
completes in approx 15-20x faster, and compresses to 1/3rd the size.
I attribute that to both mostly to working in raw strings (rather than
lines). And the output is a bit stream, rather than ascii text.
For extracting them to get the original text back, the numbers aren't
quite as amazing:
test_decompress_COMPLEVEL_1 2ms
test_decompress_COMPLEVEL_6 2ms
test_decompress_NO_COMPRESS 2ms
test_decompress_NO_FLAGS 2ms
test_decompress_difflib_diff 5ms
test_decompress_no_source_COMPLEVEL_9 2ms
test_decompress_patience_diff 5ms
For the python implementations, that includes the time to .split() and
.join() after they are done. Which I would imagine is the bulk of the time.
4) Compressing over history
For this test, I extracted all 1427 revisions of builtins.py and
compressed each text against a ancestor/child.
The total size of the texts is about 130MB. As a data point, a
tar.bz2 of these is 5MB, and a tar.gz2 is 30MB.
The knit in my repository is 2.6MB, and if I rebuild it it caches at
approximately every 43rd revision for 1.6MB.
child_to_parent_full_chain 477,349 7464ms
first_parent_to_child_full_chain 13,302,931 19448ms
parent_to_first_child_full_chain 2,989,493 10294ms
The first mode should be having each text compressed against its LH
parent. The second is compressing each text to the first child that
claims it as a LH parent. And the last is compressing relative to the
first child that claims it as any of its parents.
a) We can get really good compression across the whole ancestry. (130MB
=> 477KB is about 272:1)
b) For some reason compressing parents against their children is
actually worse than compressing children against their parents.
I would guess that extracting the child is best if it is a full text,
but it doesn't seem to save as much space.
Now it is possible that my code is incorrect, and it is doing something
weird. I did check the first few base checks, and it did seem to do the
right thing. (the first text was delta'd against its only child, etc).
I'm certainly planning on looking into it more thoroughly, but this is
the result I have so far.
John
=:->
PS> The plugin is available from:
http://bzr.arbash-meinel.com/plugins/xdelta_test
More information about the bazaar
mailing list