[RFC] compression-grouping records in pack files.

Thu Jun 21 16:12:42 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> It occurs to me that, for bundles at least, there would be advantages to
> compressing a series of records as a group.
> 

Agreed, and it was one of the big things we discussed in London.
Specifically one of the options was to have:

fulltext 1, delta 2, delta 3, delta 4

Such that each delta was based explicitly on the one before (not by
ancestry), and then the pointers in the index for all of these would
point at the 'fulltext 1' record, since you need to start reading there
to extract your 'delta 3'.

And then you could bz/gzip/etc compress that whole series. Which also
has the advantage that as you start reading the compressed stream, you
only have to decompress to the point that you've extracted what you care
about (delta 3) you don't have to decompress delta 4 (though a "naive"
implementation might, and probably wouldn't lose a whole lot unless the
chain was really long (100s/1000s))

There are some nice I/O patterns, etc to this layout. But I'm not sure
how it fits in the "Bundle Format" that has been discussed. (Whether you
would actually need to create a container in a container, or whether it
would be explicitly addressed by the overall format).

The other thing I've discovered, is that compression over reverse
ancestry is *WAY* better than any of the other base selection methods
(at least for our builtins.py file). Which doesn't play as nicely with
having deltas based on the one previous in the file (since you will have
some amount of branching).

Some comparisons:

        lh_child_child   280,676 (1 full=40,956, 1288 delta=239,720)
        lh_parent        384,440 (4 full=51,394, 1285 delta=333,046)
        by_size          540,134 (1 full=40,956, 1288 delta=499,178)
        linear           614,800 (1 full=37,897, 1288 delta=576,903)

Now, this has each delta individually compressed, rather than compressed
along the whole range (because that is how xdelta3 works).

builtins.py has 4 'root' nodes, and 1285 deltas (and in my repository
actually has about 100 'tip' nodes).

Anyway, if you compress each node versus a child who claims it as a left
hand parent (or any child if you can't find the first kind) the whole
history takes up 280kB.

If you go to each node, and compress relative to its LH parent, it is
380KB (+35% bigger).

If you just sort all of them by size, make the first node the "largest"
and every smaller file come after, it is 540kB (+92% above 'optimal')

If you just use the same topological sorting that is already present in
my builtins.py you get 614kB (+120% above the smallest).

Some other interesting facts...
4 compressed fulltexts for lh_parent fit in approximately the same size
as 1 compresed fulltext for the other algorithms. This is because those
nodes are when builtins.py was small (revno <100), rather than when
builtins.py was big (revno >2000).

So for a transmission bundle, where you want "optimum" compression,
compressing by ancestry seems a definite win.

For a "repository" bundle, where you want "optimal" extraction speed, we
may play with the tradeoffs a bit. Because 'by_size' does ok, and it
means that each revision is compressed against the one that comes before
it in the file (so you can read 200 and hit what you want).

There is another possibility, though. Robert's argument was that storing
deltas in a row means that you don't need to "precompute" which ones you
need to extract. So that information doesn't need to be saved (in the
index?).

However, when you use that approach you *have* to extract and apply all
deltas until you get to where you want to be. With the ancestry based
approach, only *some* of them are needed. For example, if you have 100
revisions of builtins.py in a row, and you use 'lh_child_child' and you
want to extract the last entry, you only need to apply 16 deltas.
(rather than 100).

Now, if you don't record the ancestry information in the index, you
still have to read all 100 records... But that is actually cheaper than
*applying* all 100 records. Which depends significantly on your algorithm.

Just to say, we can use lh_child_child, and still not store the "this
patch is based on that patch" in the index. We just store it in the
delta hunks, and cause the reader to read the same data it would read
for a linear selection, and just throws some of them away.

Which still wins overall because:
a) Delta compression is better, so it has to read less total data (~2:1)
b) When applying deltas, it has to apply far fewer (16 versus 100). You
could sort of think of this in the same way you can think of "skip
deltas". In that you jump across the ancestry to extract fewer than all
nodes. In the *worst* case, it degrades to linear, which is still no
worse than "linear".

We might consider a skip-delta base approach for bundle deltas, since it
would give us better "worst case" values. But if we have to read the
whole series anyway because of index considerations, it doesn't seem a
net win.

And one more thing I should say. xdelta3 is pretty darn good. It is
fast, works on binary data, and is fast at compression. But Mercurials
bdiff/mpatch algorithm smokes it at extraction. Especially since they
already have the "composing patches" logic worked out.

Some numbers...

using lh_child_child on 100 revisions of builtins.py to extract the tip
(1 fulltext + 15 deltas):

		extract time	~compress ratio
xdelta3		46ms  		405:1
bdiff		14ms		216:1
bdiff+composing	 5ms

linear on 100 revisions (1 fulltext + 98 deltas):
xdelta3		235ms		194:1
bdiff		 79ms		123:1
bdiff+composing	 26ms

So again, it would make sense to use xdelta3 for binary files or bundles
sent across the wire. But for a repository format where we want to
optimize for extraction speed, bdiff does a lot better.

I haven't looked at the VC_DIFF algorithm (that is the delta format) to
figure out if we could do better about extraction, or if we could work
out a way to do "composing". But that is where I'm at today.

Oh, and both of these smoke the hell out of patiencediff or difflib at
creating deltas. To delta the last revision against the previous is:

xdelta3		6ms	222 bytes
bdiff		8ms	618 bytes
bdiff+zlib	8ms	291 bytes
difflib		105ms	618 bytes
difflib+gzip	105ms	296 bytes
patience	136ms	618 bytes
patience+gzip	136ms	296 bytes

So either way we go, we can do a heck of a lot better when generating
deltas against previous texts.

I have all of these numbers (and a hell of a lot more) in my xdelta_test
plugin. I've tried a couple times now to crunch them down to 'summary'
information so that others can understand them. But there is just so
much to analyze. I may just get to a point where I just send the damn
report and stop trying to get it very understandable. People can install
the plugin for themselves and get their own numbers.

http://bzr.arbash-meinel.com/plugins/xdelta_test

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGepVqJdeBCYSNAAMRAv/YAKCXBPog0dsBpCaMNcYgq8yukTbJiwCbBk60
bWcnj5kBknrEA49ztMMQLwU=
=+jaB
-----END PGP SIGNATURE-----