Longest Common Subsequences code

John Arbash Meinel john at arbash-meinel.com
Wed Nov 1 16:29:21 GMT 2006


Cheuksan Edward Wang wrote:
> 
> 
> On 11/1/06, *John Arbash Meinel* <john at arbash-meinel.com
> <mailto:john at arbash-meinel.com>> wrote:
> 
> 
> 
>     I think it would be possible to adjust patience diff to handle when
>     there are no unique lines. I'm not sure what you mean by 'unique'
>     though. Just that every line has a pair in the file? Like:
> 
> 
> Unique lines means those lines appearing exactly once in each file.
>  

Well, I'm pretty aware of the pd algorithm, since I reviewed it and
optimized it a little bit to get it merged into bzr. But it takes a
little bit to swap back in, and I wanted to make sure we were on the
same page.

> 
>     A  A
>     B  B
>     A  A
>     B  B
> 
>     IIRC this will do 1 recurse and find no unique matching lines, and then
>     start from the top and match every line in a linear search.
> 
> 
> yes
>  
> 
>     This is a little more interesting:
> 
>     A A
>     B B
>     A C
>     B D
>       C
>       D
>       A
>       B
> 
> 
>     Which will do a similar thing, and then use the linear search to match
>     the top and bottom.
> 
>     I'm pretty sure it will "mess up" with this one:
>     A A
>     B B
>     A C
>     B D
>       A
>       B
>       C
>       D
> 
>     It should find the first AB match, but probably won't find the second AB
>     match. I'm willing to sacrifice these sorts of things because with
>     non-unique lines you quickly get into ambiguities anyway. Also,
>     non-unique lines aren't usually the "interesting" lines anyway. They are
>     the plain "else:" or blank line.
> 
>     Anyway, as I remember it, if there are absolutely no unique lines, I'm
>     pretty sure patience just falls back on a simple linear search, so it
>     should have relatively good scaling. Because it just compares the next
>     line in each case, I'm pretty sure it is approx O(N), since it doesn't
>     do a many-to-many comparison.
> 
> 
> The expected time is linear. The worst case is O(mn) where m and n are
> sizes of the 2 files. It happens when recurse_matches calls itself after
> finding, say, only one matching line at the end. The other interesting
> thing to think about is: What if there are no unique lines and the
> beginnings and ends don't match?

Well, I can already say that it won't find the ones in the middle.

At one point, I had modified patiencediff so that it would use difflib
for interior loops, so that it could find matches in stuff like:

A A
B E
C C
C C
B E
D D

In the above situation, it will find AA and DD, but it will miss the CC
matches, because they aren't unique, and they don't touch matching lines.

We took that out, because it turned the inner section into N^3 again,
and since large files with no common lines seem to trigger the N^3
behavior we got rid of it.

Also patiencediff will handle:

A A
B E
C C
C C
A A
B E
D D

Because the first pass will find the opening A and D, and then the inner
recursion will find that A is unique in that context.

>  
> The Hirschberg algorithm and unix diff (Hunt and McIlroy) both produce
> minimal diffs. They have different characteristics. If necessary (or if
> someone is really interested), we can try them.
> 

Well, when you have non-unique lines, minimal is much harder to
establish, because a given section could go either way.
Also, Aaron touched on the fact that we don't really care about minimal
diffs. The properties we would like are:

1) Delta compression of full-texts. Minimal diffs would make this
better, but I would guess only marginally so. ATM we are gzip
compressing the final hunks anyway. The most important thing is that a
change of 2 lines to a 100Kline source file should be ~2-lines, not
100Klines. And I think all algorithms will give us that.

2) Good performance, on both small and large N. We don't want an O(N^3)
algorithm. But we should keep in mind that most of our source is in the
1-10K line range. At those levels constant factors can easily dominate.
 I wouldn't want to see us tuning for 100K line files and lose
performance on small files.

3) Human-friendly deltas. When you run "bzr diff" you are meaning to
review your changes, and possibly send them to someone else. Doubly so
if you are using "bzr bundle". Which means that having the absolute
minimum diff isn't always better than having a larger diff that is more
understandable.
Patience diff is frequently a winner here. One of the most common being
stuff like code that is copy and pasted from an earlier function, and
difflib makes it look like the old function was changed. Something like:

 def original():
+  a line of code
+  another line of code
+
+def copy():
   a line of code
   another line of code

It isn't always better, but it does tend to attach hunks like that as:
 def original():
   a line of code
   another line of code

+def copy():
+  a line of code
+  another line of code

Now, an algorithm can never be perfect about this sort of thing. But
patience diff matches the unique lines first, and then attachs
non-unique ones to it, so it does pretty good about that.

There are a few places where it still falls a little flat, like with:
"    else:"

Especially with python code, you can accidentally have a uniquely
indented 'else:' line when you re-indent a section of text.

One of the things we discussed was to have "uniqueness" be determined
after removing whitespace. This helps prevent spurious matches, though
it makes more lines non-unique, which may hurt overall matching.

Another possibility is that we could look into a sub-line delta
algorithm. Certainly you could get more delta compression from it. And
sometimes that sort of annotation is nicer to see (like looking at
vimdiff, where it shows you what lines, and then what words in a given
almost-matching line, or ndiff)

A lot of our code is a little too line-based IMO. And it could stand to
use more hunk-based ideas. (Converting them to line-based for annotate,
for example).

Though because diff algorithms are frequently line-based you may have
issues with splitting and joining too often.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061101/fa7e551c/attachment.pgp 


More information about the bazaar mailing list