Tweaking PatienceDiff

Sun Apr 11 11:03:56 BST 2010

John Arbash Meinel <john <at> arbash-meinel.com> writes:
> http://bramcohen.livejournal.com/73318.html
> 
> I believe our current code does the recurse first, rather than matching
> the start and end first if they are identical.
> 
> The major difference is that for mostly-the-same-files you'll skip a lot
> of the internal "build a dict" stuff, for all of those lines that line
> up perfectly.
> 
> Something to think about.

I've experimented with this a little this morning. After modifying the the
Python version to first match lines at the start and end and only run LCS on the
rest, 'bzr log -p' on one of my projects went down from ~1:36 to ~1:28. I'm not
sure what impact would it have on the C version.

The output of the algorithm is not the same though. There is actually one test
case that breaks after changing the code. If you have two strings
'abcdefghijklmnop' and 'abcXghiYZQRSTUVWXYZijklmnop', the original version which
does a LCS first results in these matches:

abcdefghi             jklmnop
abcX  ghiYZQRSTUVWXYZijklmnop

The modified version matches these characters:

abcdefgh             ijklmnop
abcX  ghiYZQRSTUVWXYZijklmnop

I think I actually like the result of the modified one better, but I was not
able to come up with a real-world case where it would make a difference.

Lukas