[MERGE] Patience diff

Fri May 26 02:59:09 BST 2006

On 25 May 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> >> Is this the best way to write it? Basically, I needed a multi-line code
> >> snippet, that I could do diff with, and make sure it gave a reasonable
> >> answer. (It also highlights the difference between difflib and
> >> patiencediff).
> > 
> > Do you mean that the patience sequence matcher doesn't put the 
> > (len(a), len(b), 0) at the end?  If it's going to be called
> > SequenceMatcher I'd prefer it return that too.
> 
> The SequenceMatcher conforms to the interface and returns (len(a),
> len(b), 0). I just didn't want to have to write that for every
> 'chk_blocks' test case. Which is why I pop it off to verify, and then
> add the rest at the end.
> 
> The main difference between Patience and difflib, is that patience will
> attach common portions of code to the previous chunk, rather than to
> whichever chunk is longer. That is why the diffs look better, because
> you never really copy code up to an existing function.

Oh I see - I just didn't understand your mail.

> > Perhaps it's just me but the use of pop is somewhat unobvious (I had to
> > stop and think about whether python popped from the left or right); it
> > might be better to write
> > 
> >                self.assertEquals(blocks[-1], (len(a), len(b), 0))
> >                self.assertEquals(matching, blocks[:-1])
> > 
> > Robert has said we should have assertions in the form of (expected,
> > actual), though the existing code is so inconsistent I won't pick on
> > this. :-)
> 
> This code was written at UBZ, which was right around when we finally
> decided this. All of my code from here on out uses (expected, actual).
> 
> If you want, we can fix this, but looking into it, it is a lot of
> lines to fix.  The actual assertion is correct (matching is the
> expected value, blocks is the measured value).  I changed the name to
> 'expected_blocks' to make it more obvious.

I don't think we need to go through and fix all of them; I just thought
I'd mention it.  For many of the assertions of equality or inequality
the ordering of parameters doesn't matter but for things like
assertContainsRe it does.

> Done. I also cleaned up knit.py to call it 'KnitSequenceMatcher'.
> 
> The changes are in my jam-integration branch, attached is a patch for
> review.
> 
> By the way, in real world cases, I'm starting to think that we shouldn't
> match single common lines with no unique lines nearby. Because there a
> lot of times when you have large chunks that change, and it just happens
> that you have a ''' or if statement that matches in the middle. Which
> ends up making the diff look worse, because it is trying to synchronize
> on something it shouldn't.

Yes, maybe we should.  For humans having a strictly minimal diff
is suboptimal; showing those common lines as added and deleted within
larger chunks can be better.  

> === modified file 'bzrlib/merge3.py'
> --- bzrlib/merge3.py	
> +++ bzrlib/merge3.py	
> @@ -20,8 +20,11 @@
>  
>  
>  from bzrlib.errors import CantReprocessAndShowBase
> -from bzrlib.patiencediff import SequenceMatcher
> +from bzrlib.patiencediff import PatienceSequenceMatcher
>  from bzrlib.textfile import check_text_lines
> +
> +SequenceMatcher = PatienceSequenceMatcher
> +
>  
>  def intersect(ra, rb):
>      """Given two ranges return the range where they intersect or None.
> 

Point for discussion: when we have a name that's used as a callable
factory, how should it be named?  One way is to name it like a class, as
you have here, which makes it clear that you're meant to treat it like a
constructor.   On the other hand it may be a bit surprising that it
doesn't generate instances of SequenceMatcher.  It might be better to
call it, say, sequencematcher_class, or sequencematcher_factory.  (This
is a bit theoretical; don't let it block your merge but I wondered what
people thought.)

-- 
Martin