[RFC] C implementation of PatienceSequenceMatcher

Sun Jul 22 15:51:52 BST 2007

Hi all,

I wanted to learn about various diffing algorithms in detail, so I
though about implementing some of them in C/C++. Patience diff was the
first one I tried. Initial version was a naive port of the Python code
from bzrlib to C++ with Qt data containers, which turned out to be quite
slow (actually slower than the Python version, on large inputs). So I
started wondering how to speed it up, and came up with the C version
which is in the attachment. It uses the idea from Mercurial's bdiff to
identify unique lines based on their equivality classes. This was little
faster than I expected (~7,5x faster than patiencediff.py or ~5x faster
than difflib on average input, in extreme cases it's in 2,5x - 23x
faster than patiencediff.py), so I decided to make a Python module out
of it and try to integrate it to bzrlib. Here are some benchmarks (each
result is the best one of ~5 runs):

-----------------

Tested on bzr with `bzr revert -r -50`:

       479 unchanged
       177 modified
        17 removed
         1 unknown
       463 ignored
        49 versioned subdirectories

`bzr diff` with _patiencediff_py.PatienceSequenceMatcher:
real    0m3.275s
user    0m2.996s
sys     0m0.276s

`bzr diff` with _patiencediff_c.PatienceSequenceMatcher
real    0m2.273s
user    0m2.084s
sys     0m0.188s

`bzr commit` with _patiencediff_py.PatienceSequenceMatcher:
real    0m4.018s
user    0m3.680s
sys     0m0.328s

`bzr commit` with _patiencediff_c.PatienceSequenceMatcher
real    0m3.004s
user    0m2.772s
sys     0m0.236s

On Linux 2.6.1 patched to 2.6.2 (ok, this is somewhat unrealistic
use-case, but at least on real data):

     16378 unchanged
      6201 modified
       319 removed
       500 unknown
      1284 versioned subdirectories

`bzr diff` with _patiencediff_py.PatienceSequenceMatcher:
real    1m56.096s
user    1m49.347s
sys     0m5.472s

`bzr diff` with difflib.SequenceMatcher:
real    1m41.700s
user    1m31.974s
sys     0m4.700s

`bzr diff` with _patiencediff_c.PatienceSequenceMatcher
real    0m55.146s
user    0m51.091s
sys     0m3.592s

`bzr commit` with _patiencediff_py.PatienceSequenceMatcher
real    6m19.922s
user    2m32.650s
sys     0m8.969s

`bzr commit` with _patiencediff_c.PatienceSequenceMatcher
real    5m23.293s
user    1m25.357s
sys     0m7.472s

-----------------

So, it doesn't have as significant effect in bzrlib as it has on it's
own.

The question is, does it make sense to replace the current
PatienceSequenceMatcher with something like this (the attached code is
just for illustration, it's not finished yet)? Are "not so awesome"
performance improvements like this worth the price of including C source
code in bzrlib?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bzr-cpatiencediff.diff
Type: text/x-patch
Size: 64612 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070722/54ffbee0/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
	=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
	=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070722/54ffbee0/attachment-0001.pgp