xdelta multi-parent hack ?
John Arbash Meinel
john at arbash-meinel.com
Fri Jun 22 18:47:09 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> I wonder, if doing xdelta on merged texts would get better compression -
>
> e.g.
> cat parent1 parent2 combinedparents
> xdelta combinedparents childtext
>
> In principle this would allow the output to select regions from parent2
> for use in childtext, but would obviously complicate the extraction
> routine somewhat - but perhaps acceptable.
>
> -Rob
Initial reports are quite a success.
xd3-default
lh_parent 418,027
lh_child_child 296,641
multi_parent 192,329
multi_child 275,588
xd3-djw (Huffman compression of delta hunks)
lh_parent 384,390
lh_child_child 280,685
multi_parent 179,144
multi_child 260,161
xd3-NOCOMPRESS+zlib (per hunk)
lh_parent 416,434
lh_child_child 299,714
multi_parent 191,537
multi_child 278,067
xd3-NOCOMPRESS bzip2 of all texts
lh_parent 251,010
lh_child_child 199,552
multi_parent 119,954
multi_child 190,243
bdiff+zlib (per hunk)
lh_parent 577,016
lh_child_child 556,981
multi_parent 476,234
multi_child 541,441
bdiff + bzip2
lh_parent 151,674
lh_child_child 164,117
multi_parent 136,148
multi_child 164,745
mpknit + gzip hunks
multi_parent 391,567
mpknit + bzip2 of all texts
multi_parent 159,655
multi_parent does achieve the best single-hunk compression rate. Beating
lh_child_child by a fairly wide margin. (About 100k).
multi_child just uses all possible children as compression relatives.
Which probably isn't an optimal selection, though it does do slightly
better than lh_child_child
There are still a few things to consider, though.
a) It makes extraction considerably more tricky. Because instead of
knowing that you need 1 base to create the target, you may need
multiple, so you have to potentially track back into a branching
ancestry to figure out what base texts you need to recreate the current
tip. Data-wise it isn't terrible, because they all are going to exist in
the file before you get to the current point.
b) If we subscribe to the "new data is more commonly requested than old
data", then there is a distinct advantage to lh_child_child that doesn't
get mentioned otherwise. In that the newest revision is going to be a
fulltext, and "newish" ones are going to have short delta chains. While
for "lh_parent" and "multi_parent" you have to search around further.
c) lh_child_child gets most of the performance of multi_child without
the expense of having multiple bases that need to be extracted.
d) For "large" texts, this multi-parent hack will degrade, because it
can't fit everything into the window. But the text has to be very large,
and it generally should only degrade to that of lh_parent.
e) The current winner (by far) is xdelta NOCOMP + multi_parent + bzip2
of all texts. Managing a full 1002:1 compression ratio. (In comparison,
bzip2 + mpknit is 159,655 bytes).
f) Surprisingly even 'bdiff' is helped by this naive method of combining
the parent texts and just computing the output as a delta against a much
larger text. So we might consider doing this as a naive method for
regular 'patiencediff' or difflib. Which might mean we can do
multi-parent diffs without as much extra effort. (I know that xdelta can
handle moving lines around, as it is just a COPY instruction, I don't
know how 'bdiff' represents deltas internally, if they have to be in
linear order or not).
But yeah, multi-parent can indeed shrink xdeltas even further. It can
even shrink bdiff output (136,148 veruss 151,674).
The above data points also give us some ideas of what trade offs we
might want to implement. I'm also skipping the "linear" and "by size"
measurements, but I can generate them on request. And Martin mentioned
another one which was "All relative to a single fulltext (no chain)".
There is also the performance impact for creating these deltas. As near
as I can tell, it seems to be about 50%. So if it took 2s before, it now
takes 3s. (Or if it took 3s, it now takes 4.5s)
Nice thought,
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGfAsdJdeBCYSNAAMRAq4uAJ9Hm0ZG3mUhcOwCJ26WVnIblWDYIACeMolR
P7m6lyZtj8AifTb1b6gGlYc=
=toIa
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list