[MERGE] Patience diff

Thu May 25 15:29:30 BST 2006

I went ahead and fixed the problem with difflib and python2.4.3, while I
was there, I cleaned up some of the code.

Martin Pool wrote:

...

> 
> We might as well call s/cdv/patience/ consistently.

Done, a grep through the code produces no matches for 'cdv'.

> 

...

>>
>> Is this the best way to write it? Basically, I needed a multi-line code
>> snippet, that I could do diff with, and make sure it gave a reasonable
>> answer. (It also highlights the difference between difflib and
>> patiencediff).
> 
> Do you mean that the patience sequence matcher doesn't put the 
> (len(a), len(b), 0) at the end?  If it's going to be called
> SequenceMatcher I'd prefer it return that too.

The SequenceMatcher conforms to the interface and returns (len(a),
len(b), 0). I just didn't want to have to write that for every
'chk_blocks' test case. Which is why I pop it off to verify, and then
add the rest at the end.

The main difference between Patience and difflib, is that patience will
attach common portions of code to the previous chunk, rather than to
whichever chunk is longer. That is why the diffs look better, because
you never really copy code up to an existing function.

> 
> Perhaps it's just me but the use of pop is somewhat unobvious (I had to
> stop and think about whether python popped from the left or right); it
> might be better to write
> 
>                self.assertEquals(blocks[-1], (len(a), len(b), 0))
>                self.assertEquals(matching, blocks[:-1])
> 
> Robert has said we should have assertions in the form of (expected,
> actual), though the existing code is so inconsistent I won't pick on
> this. :-)

This code was written at UBZ, which was right around when we finally
decided this. All of my code from here on out uses (expected, actual).

If you want, we can fix this, but looking into it, it is a lot of lines
to fix.
The actual assertion is correct (matching is the expected value, blocks
is the measured value).
I changed the name to 'expected_blocks' to make it more obvious.

> 
>> I suppose we could move the static text out of the function and into the
>>  module scope, it may make it look a little cleaner. I don't really know
>> what the 'best' thing here is.
> 
> By "static text" you mean the triple-quoted source?  I think it's OK
> here.
> 
> However I don't think it's so good to have what looks like real Python
> source used inline in test data; it can be too confusing when e.g.
> reading a diff in mail without the benefit of syntax highlighting.  (I
> might have done it myself in the past; I'm not picking on you.)  Can I
> suggest instead that we
> 
>  - use non-python text (e.g. human language)
>  - or, if the test relates to giving sensible diffs for changes to
>    program source, deface it so that it's obviously not part of the
>    program, e.g. by squashing to uppercase or rot-13.

I just rot13'd the text. I probably could use real human text, since I
understand what is going on a little bit better. But then I would have
to find (or generate some text). And rot13 was much easier. :)

> 
>> +        self.assertEquals([ '---  \n'
>> +                          , '+++  \n'
>> +                          , '@@ -4,6 +4,11 @@\n'
>> +                          , ' d\n'
>> +                          , ' e\n'
>> +                          , ' f\n'
>> +                          , '+x\n'
>> +                          , '+y\n'
>> +                          , '+d\n'
>> +                          , '+e\n'
>> +                          , '+f\n'
>> +                          , ' g\n'
>> +                          , ' h\n'
>> +                          , ' i\n'
>> +                          ]
>> +                          , list(unified_diff(txt_a, txt_b,
>> +                                 sequencematcher=SequenceMatcher)))
> 
> Maybe it'd be clearer to rename it to PatienceSequenceMatcher, or
> say patiencediff.SequenceMatcher here, and similarly for difflib.
> There are a few occurrences.
> 

Done. I also cleaned up knit.py to call it 'KnitSequenceMatcher'.

The changes are in my jam-integration branch, attached is a patch for
review.

By the way, in real world cases, I'm starting to think that we shouldn't
match single common lines with no unique lines nearby. Because there a
lot of times when you have large chunks that change, and it just happens
that you have a ''' or if statement that matches in the middle. Which
ends up making the diff look worse, because it is trying to synchronize
on something it shouldn't.

Just a thought.

John
=:->
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cleanup-patience.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20060525/2d194f50/attachment.diff 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060525/2d194f50/attachment.pgp