VersionedFile.walk deprecated?

Fri Apr 21 18:55:13 BST 2006

On Friday, 2006-04-21 Martin Pool wrote:

> On 20 Apr 2006, John Yates <jyates at netezza.com> wrote:
> > The matching algorithm repeatedly eliminates maximal regions of
> > identical text from both files.  The work list invariant is that for any
> > line on the list there remains in each file exactly one "unmatched"
> > instance.  That invariant is incrementally maintained during the growth
> > of a matched region.  I fail to see how the order of selecting lines
> > from the work list can alter the final result.
>
> suppose you are comparing
>
>   1x2y3z4
>   1 2x3y4z
>
> [1234xyz] would all be on the work list; it would be plausible to choose
> either sequence as common but you get different results either way.
> We try to look for the longest subsequence to get the best result.

Agreed that at the outset all lines are on the work list.  This example
is slightly degenerate in that there are no common regions larger than a
single line.  Nonetheless, at termination the algorithm I wrote-up would
produce the following set of correspondences:

     1 -----------> 1
     x ------\
     2 -----------> 2
     y ----\   \--> x
     3 -----------> 3
     z --\   \----> y
     4 -----------> 4
           \------> z

If an edit is encoded as a single cursor walking over the input copying
or skipping lines and / or inserting new lines then the before sequence
can be transformed into the after sequence as:

     Copy 1          # = 1
     Skip 1          # - x
     Copy 1          # = 2
     Skip 1          # - y
     Insert 1 "x"    # + x
     Copy 1          # = 3
     Skip 1          # - z
     Insert 1 "y"    # + y
     Copy 1          # = 4
     Insert 1 "z"    # + z

If the edit is encoded as a stream of arbitrary references into the
source sequence plus insertions of new lines then the before sequence
can be transformed into the after sequence as:

     Copy 1@"1"
     Copy 1@"2"
     Copy 1@"x"
     Copy 1@"3"
     Copy 1@"y"
     Copy 1@"4"
     Copy 1@"z"

Notice that this encoding correctly captures the fact that no unfamiliar
lines have been introduced.  It would take a bit more analysis to
recover the fact that no source line was dropped and that the result is
a strict permutation.

When I have a bit more time I will construct and send along an example
where residual regions remain at the end of the matching algorithm.  My
sense is that that is where the longest common subsequence heuristic may
become interesting.

/john