[RFC] C implementation of PatienceSequenceMatcher
Lukáš Lalinský
lalinsky at gmail.com
Sun Jul 22 15:51:52 BST 2007
Hi all,
I wanted to learn about various diffing algorithms in detail, so I
though about implementing some of them in C/C++. Patience diff was the
first one I tried. Initial version was a naive port of the Python code
from bzrlib to C++ with Qt data containers, which turned out to be quite
slow (actually slower than the Python version, on large inputs). So I
started wondering how to speed it up, and came up with the C version
which is in the attachment. It uses the idea from Mercurial's bdiff to
identify unique lines based on their equivality classes. This was little
faster than I expected (~7,5x faster than patiencediff.py or ~5x faster
than difflib on average input, in extreme cases it's in 2,5x - 23x
faster than patiencediff.py), so I decided to make a Python module out
of it and try to integrate it to bzrlib. Here are some benchmarks (each
result is the best one of ~5 runs):
-----------------
Tested on bzr with `bzr revert -r -50`:
479 unchanged
177 modified
17 removed
1 unknown
463 ignored
49 versioned subdirectories
`bzr diff` with _patiencediff_py.PatienceSequenceMatcher:
real 0m3.275s
user 0m2.996s
sys 0m0.276s
`bzr diff` with _patiencediff_c.PatienceSequenceMatcher
real 0m2.273s
user 0m2.084s
sys 0m0.188s
`bzr commit` with _patiencediff_py.PatienceSequenceMatcher:
real 0m4.018s
user 0m3.680s
sys 0m0.328s
`bzr commit` with _patiencediff_c.PatienceSequenceMatcher
real 0m3.004s
user 0m2.772s
sys 0m0.236s
On Linux 2.6.1 patched to 2.6.2 (ok, this is somewhat unrealistic
use-case, but at least on real data):
16378 unchanged
6201 modified
319 removed
500 unknown
1284 versioned subdirectories
`bzr diff` with _patiencediff_py.PatienceSequenceMatcher:
real 1m56.096s
user 1m49.347s
sys 0m5.472s
`bzr diff` with difflib.SequenceMatcher:
real 1m41.700s
user 1m31.974s
sys 0m4.700s
`bzr diff` with _patiencediff_c.PatienceSequenceMatcher
real 0m55.146s
user 0m51.091s
sys 0m3.592s
`bzr commit` with _patiencediff_py.PatienceSequenceMatcher
real 6m19.922s
user 2m32.650s
sys 0m8.969s
`bzr commit` with _patiencediff_c.PatienceSequenceMatcher
real 5m23.293s
user 1m25.357s
sys 0m7.472s
-----------------
So, it doesn't have as significant effect in bzrlib as it has on it's
own.
The question is, does it make sense to replace the current
PatienceSequenceMatcher with something like this (the attached code is
just for illustration, it's not finished yet)? Are "not so awesome"
performance improvements like this worth the price of including C source
code in bzrlib?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bzr-cpatiencediff.diff
Type: text/x-patch
Size: 64612 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070722/54ffbee0/attachment-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070722/54ffbee0/attachment-0001.pgp
More information about the bazaar
mailing list