Better compression

Sat Jul 26 04:05:13 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:

|>
|> So I went ahead and hacked up some code to force reverse-order
|> insertion. There were a couple things I discovered:
|
| I have generic stuff for this partly done in a bzr branch; I'll publish
| that shortly.
|
|> 1) Changing the GenericRepoFetcher to use full-texts rather than using
|> an adapter was quite an improvement for small projects (like bzrtools).
|>
|> For specific results, doing a normal fetch using an adapter took 40s for
|> bzrtools. Using full-texts takes 6.8s. (That includes all the time to
|> compress the texts in the target.) Of course, branching into a pack-0.92
|> is only 1.8s.
|
| This is because knits are doing less string manipulation to get the
| fulltext out than extracting the fulltext from the in-progress
| compression group.
|

Sure, they already had the fulltext in line-form. But truth be told, so
did we. So I added a LRUSizeCache capped at ~10MB of data. The idea is
that because we are requesting things in topological order, and we delta
in topological order, *most* of the time we will need the text we just
inserted.

In testing, I get really good results. It also revealed the need for my
other patch to stream Revisions and Signatures (because otherwise my
screen fills up with requesting revision 1-by-1.)

These are the times for branching bzrtools into a gc repository

Time	Stream Revisions	LRU Cache
37s	No			No
24s	Yes			No
20s	No			Yes
8s	Yes			Yes

Now, this is still going in forward order, and requesting deltas and
then unpacking them using the source. I just shortcut via a LRUSizeCache
to handle common requests. (so that is 12s for streaming revisions, and
16s for an LRU cache.)

For bzrtools, the cache looks like:
~     Texts Cache fill: 8530336, hit rate: [2046, 1442, 28]
~ Inventory Cache fill: 8585591, hit rate: [1090, 979, 0]
~ Revisions Cache fill: 786523, hit rate: [1090, 0, 0]
Signatures Cache fill: 37900, hit rate: [51, 0, 0]

The cache should fluctuate between 8MB and 10MB. Because when we hit 10
I flush down to 8, so that I'm not spending all my time in the 'flush' code.

So we insert 2046 texts into the cache, we request 1442+28 of them back,
and with an average of 8MB cached, we match 1442 of them (98% hit rate).
If I bumped the cache to 10MB at flush, there are 3 more entries that 'hit'.

For bzr.dev, total time to branch with streaming revisions and a 10MB
LRU cache was 14m33s (down from 90min last night, though not as good as
the 7min with the hacked up versions.):

The cache rates were:
~     Texts Cache fill: 10431293, hit rate: [40837, 34581, 2485]
~ Inventory Cache fill: 8869491, hit rate: [18675, 17040, 1043]
~ Revisions Cache fill: 8609561, hit rate: [18675, 0, 0]
Signatures Cache fill: 5776631, hit rate: [7832, 0, 0]

The size went down a bit, 41MB => 35MB. My best guess is that we weren't
properly compacting the Revisions before. Which is strange because of
what I found with 'bzr log'.

Anyway, a 10MB cache gives us a 93% hit rate for bzr.dev texts, which
seems to make it more than worthwhile. I would consider tweaking it a
bit, to try to make the estimated size more closely line up with
reality. But I do take into account the python overhead of string
objects, and the overhead of storing lots of lines in a list (at least
4bytes for a pointer per line, and probably 20+ bytes per string for
PyObject overhead.)

The LRU Cache is in my groupcompress branch, and the streaming revisions
is up for review.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiKlGkACgkQJdeBCYSNAAOD6gCgqfsi/tCp+D2viHL/B4aSdoxD
+swAn0DYp7+6ghNIyTzB0X8JPi87zmS6
=ohum
-----END PGP SIGNATURE-----