[MERGE] Patience diff

Wed May 24 02:16:04 BST 2006

On 23 May 2006, John A Meinel <john at arbash-meinel.com> wrote:
> Aaron Bentley wrote:
> > I checked the latest on revctrl.org.  It seems our version is the
> > newest.  Actually, someone added a bunch of epydoc-style comments to our
> > version, so it's not a 1-1 match anymore.
> 
> Actually, that would probably be me. :) I went through and tried to
> document what the different parameters were, just so I could follow what
> was going on.

I like that, bram's code is a bit taciturn.

> > How do you like me now?

Enormously.  +1 with some comments.

> +def main(args):
> +    import optparse
> +    p = optparse.OptionParser(usage='%prog [options] file_a file_b'
> +                                    '\nFiles can be "-" to read from
> stdin')
> +    p.add_option('--cdv', dest='matcher', action='store_const',
> const='cdv',
> +                 default='cdv', help='Use the cdv difference algorithm')
> +    p.add_option('--difflib', dest='matcher', action='store_const',
> const='difflib',
> +                 default='cdv', help='Use python\'s difflib algorithm')
> +
> +    algorithms = {'cdv':SequenceMatcher, 'difflib':difflib.SequenceMatcher}

We might as well call s/cdv/patience/ consistently.

> +    def test_multiple_ranges(self):
> +        # There was an earlier bug where we used a bad set of ranges,
> +        # this triggers that specific bug, to make sure it doesn't regress
> +        def chk_blocks(a, b, matching):
> +            # difflib always adds a signature of the total
> +            # length, with no matching entries at the end
> +            s = SequenceMatcher(None, a, b)
> +            blocks = s.get_matching_blocks()
> +            x = blocks.pop()
> +            self.assertEquals(x, (len(a), len(b), 0))
> +            self.assertEquals(matching, blocks)
> +
> +        chk_blocks('abcdefghijklmnop'
> +                 , 'abcXghiYZQRSTUVWXYZijklmnop'
> +                 , [(0, 0, 3), (6, 4, 3), (9, 20, 7)])
> +
> +        chk_blocks('ABCd efghIjk  L'
> +                 , 'AxyzBCn mo pqrstuvwI1 2  L'
> +                 , [(0,0,1), (1, 4, 2), (4, 7, 1), (9, 19, 1), (12, 23,
> 3)])
> +
> 
> Is this the best way to write it? Basically, I needed a multi-line code
> snippet, that I could do diff with, and make sure it gave a reasonable
> answer. (It also highlights the difference between difflib and
> patiencediff).

Do you mean that the patience sequence matcher doesn't put the 
(len(a), len(b), 0) at the end?  If it's going to be called
SequenceMatcher I'd prefer it return that too.

Perhaps it's just me but the use of pop is somewhat unobvious (I had to
stop and think about whether python popped from the left or right); it
might be better to write

               self.assertEquals(blocks[-1], (len(a), len(b), 0))
               self.assertEquals(matching, blocks[:-1])

Robert has said we should have assertions in the form of (expected,
actual), though the existing code is so inconsistent I won't pick on
this. :-)

> I suppose we could move the static text out of the function and into the
>  module scope, it may make it look a little cleaner. I don't really know
> what the 'best' thing here is.

By "static text" you mean the triple-quoted source?  I think it's OK
here.

However I don't think it's so good to have what looks like real Python
source used inline in test data; it can be too confusing when e.g.
reading a diff in mail without the benefit of syntax highlighting.  (I
might have done it myself in the past; I'm not picking on you.)  Can I
suggest instead that we

 - use non-python text (e.g. human language)
 - or, if the test relates to giving sensible diffs for changes to
   program source, deface it so that it's obviously not part of the
   program, e.g. by squashing to uppercase or rot-13.

> +        self.assertEquals([ '---  \n'
> +                          , '+++  \n'
> +                          , '@@ -4,6 +4,11 @@\n'
> +                          , ' d\n'
> +                          , ' e\n'
> +                          , ' f\n'
> +                          , '+x\n'
> +                          , '+y\n'
> +                          , '+d\n'
> +                          , '+e\n'
> +                          , '+f\n'
> +                          , ' g\n'
> +                          , ' h\n'
> +                          , ' i\n'
> +                          ]
> +                          , list(unified_diff(txt_a, txt_b,
> +                                 sequencematcher=SequenceMatcher)))

Maybe it'd be clearer to rename it to PatienceSequenceMatcher, or
say patiencediff.SequenceMatcher here, and similarly for difflib.
There are a few occurrences.

-- 
Martin