[RFC] C implementation of PatienceSequenceMatcher

Sun Jul 22 16:12:31 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:
> Hi all,
> 
> I wanted to learn about various diffing algorithms in detail, so I
> though about implementing some of them in C/C++. Patience diff was the
> first one I tried. Initial version was a naive port of the Python code
> from bzrlib to C++ with Qt data containers, which turned out to be quite
> slow (actually slower than the Python version, on large inputs). So I
> started wondering how to speed it up, and came up with the C version
> which is in the attachment. It uses the idea from Mercurial's bdiff to
> identify unique lines based on their equivality classes. This was little
> faster than I expected (~7,5x faster than patiencediff.py or ~5x faster
> than difflib on average input, in extreme cases it's in 2,5x - 23x
> faster than patiencediff.py), so I decided to make a Python module out
> of it and try to integrate it to bzrlib. Here are some benchmarks (each
> result is the best one of ~5 runs):

...

> Bazaar
> `bzr commit` with _patiencediff_py.PatienceSequenceMatcher:
> real    0m4.018s
> 
> `bzr commit` with _patiencediff_c.PatienceSequenceMatcher
> real    0m3.004s

...

> Linux
> `bzr diff` with _patiencediff_py.PatienceSequenceMatcher:
> real    1m56.096s
> `bzr diff` with _patiencediff_c.PatienceSequenceMatcher
> real    0m55.146s

> `bzr commit` with _patiencediff_py.PatienceSequenceMatcher
> real    6m19.922s
> user    2m32.650s
> sys     0m8.969s
> 
> `bzr commit` with _patiencediff_c.PatienceSequenceMatcher
> real    5m23.293s
> user    1m25.357s
> sys     0m7.472s
> 
> -----------------
> 
> So, it doesn't have as significant effect in bzrlib as it has on it's
> own.
> 
> The question is, does it make sense to replace the current
> PatienceSequenceMatcher with something like this (the attached code is
> just for illustration, it's not finished yet)? Are "not so awesome"
> performance improvements like this worth the price of including C source
> code in bzrlib?
> 

I think it has a bit more effect than you are giving credit for. And as we
improve other places, it will be even bigger.

Specifically:

bzr commit drops from 4s to 3s. Which is pretty darn good.
bzr diff drops from 1m56s to 55s, 2x as fast.

Now, in both cases, probably the bigger expense is our Knit extraction code
(pulling data out of knits is one of the next big bottlenecks).

My feelings on this:

1) You've already written the code, and hopefully it is well tested. You may
not have hooked into it already, but you can use some of my tricks to have the
PatienceDiff test suite run against both the python implementation and the C
implementation.

If you have the test coverage, and the code is written. And it *does* help,
seems worthy of inclusion to me.

2) As other parts of our code get faster, the diff algorithm becomes more of a
bottleneck. I believe David Allouche has seen diff being one of the biggest
bottlenecks for conversion scripts. (Because they commit over and over again.)

3) We generally have chosen to use Pyrex for extensions, in the hope that it
might be more maintainable. (As some of the code can be written in a more
'pythonic' fashion. Certainly it is more obvious to use classes, etc).
However, if this is a fairly clean C implementation, I don't think we would
have any reason not to use it.

I haven't actually reviewed the code, I just thought I would give my feelings
about the concept.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGo3PeJdeBCYSNAAMRAltPAKCG8/Urdzm2s1iz5DLnGGTbp22wAQCgjbjS
ykU7xArpZI/buFuZUTNogHg=
=GFz8
-----END PGP SIGNATURE-----