Longest Common Subsequences code

John Arbash Meinel john at arbash-meinel.com
Tue Oct 31 17:48:34 GMT 2006


Cheuksan Edward Wang wrote:
> 
> I'm looking for python code that implements Longest Common Subsequences
> efficiently. Idealy, it should just implement the get_matching_blocks
> method in SequenceMatcher in quadratic time and linear space. i.e.
> Hirschberg's algorithm. Does any other python VCS have it? If anyone can
> give me some pointers to some code, it can save me some time.
> 
> At the moment, the python difflib is using cubic time. Patience diff
> doesn't have optimality guarantees if there are no unique lines.
> 
> Thanks
> 
> Cheuksan Edward Wang

I don't know of any offhand.

I think it would be possible to adjust patience diff to handle when
there are no unique lines. I'm not sure what you mean by 'unique'
though. Just that every line has a pair in the file? Like:

A  A
B  B
A  A
B  B

IIRC this will do 1 recurse and find no unique matching lines, and then
start from the top and match every line in a linear search.

This is a little more interesting:

A A
B B
A C
B D
  C
  D
  A
  B


Which will do a similar thing, and then use the linear search to match
the top and bottom.

I'm pretty sure it will "mess up" with this one:
A A
B B
A C
B D
  A
  B
  C
  D

It should find the first AB match, but probably won't find the second AB
match. I'm willing to sacrifice these sorts of things because with
non-unique lines you quickly get into ambiguities anyway. Also,
non-unique lines aren't usually the "interesting" lines anyway. They are
the plain "else:" or blank line.

Anyway, as I remember it, if there are absolutely no unique lines, I'm
pretty sure patience just falls back on a simple linear search, so it
should have relatively good scaling. Because it just compares the next
line in each case, I'm pretty sure it is approx O(N), since it doesn't
do a many-to-many comparison.

The biggest problem with Patience right now, is that I think the
constant factors are much bigger than difflib, so for "well behaved"
files, it is a little bit slower.

Limiting recursion, and possibly caching some of the information that we
just determined might help with some of that.

Also, doing things as a queue rather than a recursive stack should make
it faster in python. If we can define it in terms of generator functions
rather than recursive function calls, that could also help things. (a
generator doesn't need to set up its stack frame on each re-entry, so it
saves a bit of overhead).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061031/30445f80/attachment.pgp 


More information about the bazaar mailing list