[MERGE][1.0] update C PatienceDiff to allow for non-strings

John Arbash Meinel john at arbash-meinel.com
Tue Dec 4 16:31:52 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The existing C extension for patience diff was focused on just supporting
strings. It used a special hash function, and uses memcmp() to compare the strings.

Aaron pinged me on IRC to see if that was strictly required. I think he wanted
to use patience diff for comparing annotated texts. And being able to use
tuples there would be convenient.

So this patch allows for arbitrary Python Objects. As long as they support
PyObject_Hash (though I probably need to check the return value a bit better),
PyObject_Length, and PyObject_Compare.

I then wrote a simple command which allows you to extract a bunch of texts from
the repository, and then just run 'get_matching_blocks' for each version
against the previous version.

What I found was pretty startling.... Unless I've messed something up, it is
actually faster than it was before.

I did some testing on bzrlib/builtins.py, and found this:

getting versions ... 1623
getting texts ... 4581200 total bytes
patience_tuples: Total: 2.850s, min: 0.000ms avg: 0.548ms max: 10.000ms
bzr.dev:         Total: 3.870s, min: 0.000ms avg: 0.684ms max: 10.000ms

A few other bits of interest... In both cases the average diff time is fast
enough to be ~1ms, aka hard to count exactly.

In doing a bit more testing, it seems to be the hash function that made all the
difference. If I just change it so that 'hash' is a long, and I use
"PyObject_Hash" instead of the custom 'djb2' hash, the time is the same. (2.850s).

I don't know if the strings have already cached their hashes, or just if the
python hash function is just that much faster to compute. I'm wondering if it
computes it by long, rather than by character...

Anyway, by saving up the hashes and lengths we still get a faster initial
compare (using just PyObject_Compare slowed things down to >4s).

Anyway, I think this is worthy to merge into 1.0, it is a simple and
straightforward improvement to the diff algorithm.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHVYD4JdeBCYSNAAMRAtmfAKCtvWXFBv8UDhZ3yQpZyeh7R/VVaQCfaFaR
hMrj+o1fmAbYtbx73CfNGRQ=
=mKoB
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patience_tuples.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20071204/a717d3c8/attachment-0001.diff 


More information about the bazaar mailing list