[RFC][merge] Patience Sorting Diffs

Mon Dec 12 17:24:29 GMT 2005

Back at UBZ I put together a branch for handling Codeville's Patience
Sorting algorithm.
http://revctrl.org/PreciseCodevilleMerge

My branch is available here:
http://bzr.arbash-meinel.com/branches/bzr/cdv-diff/

It isn't the full implementation of their merging algorithm, just using
their diffing operation.

It is computationally similar to the current difflib. It takes slightly
longer, but not terribly so. The real advantage is that it does give
slightly better diffs, especially when you copy text around.

If I understand it correctly, Python's difflib uses a longest common
string matching, so it compares 2 texts to find the longest matching
string, then finds the next longest, etc. This is reasonably close to
what "diff" uses, and gives decent results.
However, it really starts to mess up when you have common lines, like
C++ open/close curly braces, etc. It also messes up when you create a
new function by copying an old one, and only changing a couple of lines.
As it will often attribute the copied lines as being added in the old
function, rather than in the new one.

The PatienceSorting algorithm that Codeville uses has a different
approach. They first match only unique lines in the 2 texts (so they
ignore curly brace lines, empty lines, etc). They then use the Patience
Sorting algorithm to match the longest common subset (not string). And
then they fill in the matches inbetween the unique lines.

The specific implementation for CDV had what I consider to be a bug,
where non-unique lines, surrounded by unmatched lines would not be
matched. I fixed that behavior by resorting back to longest common
substring inbetween the unique matches.
To give an example:

AbcDbcE
FbcGbcH

In the above, it is pretty obvious that the 'bc' sections should match.
You run into some problems if you had something like:
AbcDbcEbcI
FbcGbcH

Because which section should match? Codeville will say that nothing
matches, I think just trying to match something is sufficient.

Anyway, I updated the codebase so that we can plug in different
SequenceMatcher objects. I also added a main() function, so that you can
run cdvdifflib as a command line application.

I did run a test which upgraded an old v4 bzr.dev archive to the new
weave format, using cdv diff instead of difflib. And 'bzr check' works.
I'm pretty positive that I generate valid diffs (since I mostly just use
difflibs builtin code to generate them). They may not be optimal, but I
think they are better than difflib.

I would like to see this code merged in sometime, but I would first like
people to play with cdvdiff and see what they think.
(The command to run is ./bzrlib/cdv/cdvdifflib.py file_a file_b, though
the builtin 'bzr diff' has also been updated).

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051212/b17e9be7/attachment.pgp