gc vs multiparent-xdelta results

Fri Feb 20 22:01:15 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I went ahead and hacked together an xdelta repository implementation.
Thanks a lot to the groupcompress plugin, which had already put together
at least the basic structure for the VersionedFile and Repository
implementations.

To set it up, you first need to download and install my pyrex wrapper
around xdelta:
  bzr branch lp:~jameinel/+junk/xdelta3-pyrex
  cd xdelta3-pyrex
  python setup.py install

And then you can install this plugin:
  lp:~jameinel/+junk/bzr-xdelta-repo

At the moment, it doesn't support CHK repositories correctly, but I'm on
my way there.

What it *does* do is use xdelta3 and does a multi-parent compression.
Meaning it compresses the current text versus all parents. My past
testing showed that it worked fairly well, so I figured I'd give it a shot.

The current implementation doesn't cap the delta chains (which is a
little bad, but I was trying to see what max-compression would get us
to., and uses a recursive search for texts, so it certainly isn't
optimal yet. (On the plus side, because xdelta.decode() is in C, we
could actually make use of threading to handle extracting more than one
path along the parent texts.)

The other nice thing about multi-parent xdelta is that because texts
based on things in the past, you don't have to worry about ordering.

So from experience so far, it occasionally hangs for a long time,
because the parent texts aren't in the cache, and it has to unpack the
whole chain. I'm sure we could do a bit better, both by inserting
fulltexts via some heuristic, and by reading more than one hunk from
disk at a time.

I should also mention that I've tweaked groupcompress some more. Last
night I realized that one of the problems was that it was finding short
matches, which then "interrupted" the ability to find longer ones. So in
the current incarnation, I actually refuse to match anything shorter
than 10 bytes (it is likely to end up as a insert record anyway). I also
added the ability to do a "soft" compression. Which means that I don't
accept matches shorter than 200 bytes. And then I changed the GCVF
logic, so that when you start a new file_id, it sets the 'soft' flag. It
also maxes out at 8MB, and 4MB on a file-id change.

The ideas are:
1) You can still get inter-text compression, as long as the chunks that
are similar are reasonably long.
2) Instead of getting lots of "insert one line" or "copy 10 bytes", you
now get "copy 16 bytes", etc. I'm considering actually increasing that
value. (In looking at xdelta again, xdelta1 actually didn't match
anything < 32 bytes.)
3) If you are going to run out of space and start a new Group, you may
as well do it when starting a new file-id.
4) If you go past 10MB, then all of your references get an extra digit.
So I wanted the threshold to be just under 10MB. (xdelta3 uses an 8MB
page for exactly this reason, with a max of 16MB, and warns that making
it big causes the variable-bit-width references to grow.)
5) With these various changes I got back to my "best compression with
GC" which was around 11MB.

On to results (mysql-5.1 @-r525):

pack-1.9:
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3949 KiB   0%      1201 KiB   8%     1043
Inventories:   842115 KiB  48%      1840 KiB  12%     1043
Texts:         882272 KiB  51%     11174 KiB  78%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        1728337 KiB 100%     14216 KiB 100%     9312

gc-chk255 with
                      Raw    %    Compressed    %  Objects
Revisions:       3990 KiB   0%       826 KiB   3%     1043
Inventories:    31012 KiB   3%     11819 KiB  49%    12090
Texts:         882272 KiB  96%     11301 KiB  47%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:         917275 KiB 100%     23947 KiB 100%    20359

multi-parent xdelta:
Commits: 1043
                      Raw    %    Compressed    %  Objects
Revisions:       3949 KiB   0%      1269 KiB  11%     1043
Inventories:   842115 KiB  48%       472 KiB   4%     1043
Texts:         882272 KiB  51%      9192 KiB  84%     7226
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        1728337 KiB 100%     10934 KiB 100%     9312

So the inventory size between gc and xdelta aren't comparable, as one is
in split-inventory form. But notice that we are down to <500KB or <4%
inventory overhead. My guess is that the unlimited delta chains are the
big win here (considering knits still use the sum(size(delta)) >
size(full), which means they insert fulltexts more often than every 200
sometimes).

On top of that, though, the total text compression is 9MB, which is
smaller than both knit and gc. Not by a lot, but smaller

Multiparent is also pretty darn good, though. Considering that most
merges are going to bring in the exact text from one parent, etc.

Probably my next thing to test with xdelta is going to be "compress
against all close-by ancestors, newer and older, and pick the smallest".
 Which is at least an approximation of the git algorithm. The current
code would be able to handle it, as long as the texts are already in the
repo.

I should also make a comment about timing. To create the gc+chk repo
took 9m45s (most of that is spent compressing the texts). To create the
xdelta repo only took ~3min. However, for extracting everything (via
'bzr repository-details') xdelta takes 1m46s, while groupcompress only
takes 18s. I believe xdelta gets a 'brain-fart' somewhere in the middle,
where the page cache gets a miss, and then has massive swapping as it
has to page in all of the parents in the chain.

(With other tunings of groupcompress, I was easily up to 30s to 1min to
extract everything, and before I optimized 'unordered' it was 7-10min.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmfKCoACgkQJdeBCYSNAAOdVwCfahOmv4ivX3l4UpJz7r7YZXKx
8ycAnipMlEoDJ/UVhJkAG4Y+HbnzhRtD
=3QSu
-----END PGP SIGNATURE-----