xdelta multi-parent hack ?

John Arbash Meinel john at arbash-meinel.com
Fri Jun 22 18:47:09 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I wonder, if doing xdelta on merged texts would get better compression -
> 
> e.g.
> cat parent1 parent2 combinedparents
> xdelta combinedparents childtext
> 
> In principle this would allow the output to select regions from parent2
> for use in childtext, but would obviously complicate the extraction
> routine somewhat - but perhaps acceptable.
> 
> -Rob

Initial reports are quite a success.

xd3-default
	lh_parent	418,027
	lh_child_child	296,641
	multi_parent	192,329
	multi_child	275,588

xd3-djw (Huffman compression of delta hunks)
	lh_parent	384,390
	lh_child_child	280,685
	multi_parent	179,144
	multi_child	260,161

xd3-NOCOMPRESS+zlib (per hunk)
	lh_parent	416,434
	lh_child_child	299,714
	multi_parent	191,537
	multi_child	278,067

xd3-NOCOMPRESS bzip2 of all texts
	lh_parent	251,010
	lh_child_child	199,552
	multi_parent	119,954
	multi_child	190,243

bdiff+zlib (per hunk)
	lh_parent	577,016
	lh_child_child	556,981
	multi_parent	476,234
	multi_child	541,441

bdiff + bzip2
	lh_parent	151,674
	lh_child_child	164,117
	multi_parent	136,148
	multi_child	164,745

mpknit + gzip hunks
	multi_parent	391,567

mpknit + bzip2 of all texts
	multi_parent	159,655
	

multi_parent does achieve the best single-hunk compression rate. Beating
lh_child_child by a fairly wide margin. (About 100k).

multi_child just uses all possible children as compression relatives.
Which probably isn't an optimal selection, though it does do slightly
better than lh_child_child

There are still a few things to consider, though.

a) It makes extraction considerably more tricky. Because instead of
knowing that you need 1 base to create the target, you may need
multiple, so you have to potentially track back into a branching
ancestry to figure out what base texts you need to recreate the current
tip. Data-wise it isn't terrible, because they all are going to exist in
the file before you get to the current point.

b) If we subscribe to the "new data is more commonly requested than old
data", then there is a distinct advantage to lh_child_child that doesn't
get mentioned otherwise. In that the newest revision is going to be a
fulltext, and "newish" ones are going to have short delta chains. While
for "lh_parent" and "multi_parent" you have to search around further.

c) lh_child_child gets most of the performance of multi_child without
the expense of having multiple bases that need to be extracted.

d) For "large" texts, this multi-parent hack will degrade, because it
can't fit everything into the window. But the text has to be very large,
and it generally should only degrade to that of lh_parent.

e) The current winner (by far) is xdelta NOCOMP + multi_parent + bzip2
of all texts. Managing a full 1002:1 compression ratio. (In comparison,
bzip2 + mpknit is 159,655 bytes).

f) Surprisingly even 'bdiff' is helped by this naive method of combining
the parent texts and just computing the output as a delta against a much
larger text. So we might consider doing this as a naive method for
regular 'patiencediff' or difflib. Which might mean we can do
multi-parent diffs without as much extra effort. (I know that xdelta can
handle moving lines around, as it is just a COPY instruction, I don't
know how 'bdiff' represents deltas internally, if they have to be in
linear order or not).

But yeah, multi-parent can indeed shrink xdeltas even further. It can
even shrink bdiff output (136,148 veruss 151,674).


The above data points also give us some ideas of what trade offs we
might want to implement. I'm also skipping the "linear" and "by size"
measurements, but I can generate them on request. And Martin mentioned
another one which was "All relative to a single fulltext (no chain)".

There is also the performance impact for creating these deltas. As near
as I can tell, it seems to be about 50%. So if it took 2s before, it now
takes 3s. (Or if it took 3s, it now takes 4.5s)

Nice thought,

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGfAsdJdeBCYSNAAMRAq4uAJ9Hm0ZG3mUhcOwCJ26WVnIblWDYIACeMolR
P7m6lyZtj8AifTb1b6gGlYc=
=toIa
-----END PGP SIGNATURE-----



More information about the bazaar mailing list