[brisbane:MERGE] Insert 'holes' in the DeltaIndex

Thu Mar 19 18:06:30 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The attached patch improves the compression speed of gc repositories by
a fair amount. It changes how the delta_index is defined. Originally,
the index was a packed array sorted by hash bucket. (So all of the
entries that fit in bucket 0 were stored linearly, immediately followed
by all the keys in bucket 1. An additional array of pointers was kept
that pointed at the first entry in each bucket.)

This patch changes it so that after we put the items in each bucket, we
pad it by (currently 4) empty entries. This allows us to add a couple
more entries without having to rebuild the whole object.

It also revealed an interesting problem, which has now been fixed.
Specifically, the RABIN code creates an index of each 16-byte range.
However, the matching code only matches ranges that are at least
19-bytes long. It does the rabin match, and then checks that at least 4
bytes are identical, starting at the last byte of the RABIN lookup.
Technically, it could be less than 19 if the RABIN gets an exact 32-bit
value collision, and the last character still matches, but that is unlikely.

Anyway, if there was an insert of 16-19 bytes, we would create an index
entry for them, but would never match against it. And when that text was
repeated for 10+ content texts, we would get 10 inserts, but *also* 10
items in the delta index, which bloats the index, causes us to check 10
possible matches that are too short to be actually used, etc. (We
actually had a similar problem in the pre-RABIN code, which I also fixed
with a minimum match size.)

I'm still curious as to whether we want to increase or decrease that
minimum match size. In my old testing, having a minimum size was
beneficial (10 was ~optimal, I didn't play a lot with it). Having a
larger minimum size meant that you tend to get a single content insert,
followed by a bunch of copies of that content. Rather than lots of
partial content inserts.

Put in concrete terms, it should average about 6 bytes to issue a copy
statement, so 2 copies is 12 bytes. If you have 10 texts inserting 2
copies == 120 bytes, versus if the first one issued an insert of XX
bytes, and the rest just had a single copy range for 54 bytes + XX
content. XX = 66 bytes is the break even point. (Insert 66 bytes + 9
6-byte copy instructions == 20 copy instructions.) This, of course,
depends on how many times you would actually copy the content. Also,
this isn't perfectly true, because the zlib pass can remove a lot of the
copy redundancy.

Anyway, with this patch the time to do delta compression is greatly
improved. ('bzr pack' on mysql-525 is approx 30s=>~21s.) The breakdown
used to be 9s for 'make_delta', 5s for add_fulltext, and 5s for
add_deltas. It is down to about 6.6s for make_delta, 1.2s for add_source
and 0.3s for add_delta. On the Launchpad code it was 250 make_delta, 17
add_delta, 12 add_fulltext. Even more interesting was that we now spend
more time in the btree_index code than we do in the content compression
code.

Though I do have the 'full compress' set to true, and we have 400k chk
nodes, which means the BTreeBuilder is spilling to disk (and we have a
small bug that spilling does so with the optimize flag on).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknCiaYACgkQJdeBCYSNAANzhwCfcqmUaF7DDkzWqshGM1cvOuXP
Rl0AmwRTcJ8tg0iE5rX+0iEi+WsIvA0d
=AT4g
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: brisbane-gc-delta-index.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090319/9c2c9f7a/attachment-0001.diff