[MERGE] Some tweaks for PythonGroupCompressor

John Arbash Meinel john at arbash-meinel.com
Thu Apr 23 00:29:00 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This has a few tweaks to the python group compressor code. The
fundamental change is that 'time bzr pack' on my mysql 1k tree is 4min
down from 30min (w/ extensions it is 20s).

I split out some of the matching code, so it does a little bit better
(we have some issues with a max insert of 127 bytes, which conflicts
with only matching perfect lines, etc.)

The big difference is changing how _get_longest_match works. It turned
out that we spend a lot of time in:

  copy_ends = [loc + 1 for loc in locations]

This is because the line '\n' occurs 20k times or so. If you remove '\n'
from possible matches, you end up artificially breaking up your match
hunks (lines before and lines after match, but you always split them
with an insert in the middle).

So the new code waits to do the increment until it knows that it will
use it, which was a modest win. It also tracks the locations as sets,
which is a bigger win, because it doesn't have to take a large list, put
it in a set, and then pull it out again.

At the moment, the pure python version is actually a *lot* worse for
delta compression. In my test, the final result is 19.6M versus 11M. My
guess it is the interaction of the 127-byte max insert size with the
'only match complete lines' code. So if you have a 130 byte line in a
delta, it won't ever be used as a match.

I've considered changing the delta format so that we can have >127 byte
inserts. But I want to ensure that we'll get genuine benefit before
that. (Which is also part of this code, adding a _dump method so I can
work out what is actually going on, etc.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknvqDwACgkQJdeBCYSNAANNNwCgySTjlZ6rTdC/fCGY3Tk4EuH7
QTMAniD27JrW8oLFneM9S30wk46Z948v
=aRU5
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: groupcompress_info.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090422/c9faa4ab/attachment-0001.diff 


More information about the bazaar mailing list