One case where patience diff does much better

Sun Feb 26 21:41:09 GMT 2006

James Blackwell wrote:
> On Mon, Feb 27, 2006 at 05:37:36AM +1100, Martin Pool wrote:
>> On 26 Feb 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
>>
>>> Wow, resurrecting an *old* thread. :) I saw someone accessed the branch,
>>> and wondered why. :)
>> I think I was away or too busy when it originally happened.  There's a
>> bug open about diffutils getting tied up on pathological input, and I
>> was talking to Bram about it.
>>
>>> If it isn't on by default, I don't think it will be used. I think a
>>> reasonable test case would be to upgrade an existing bzr branch, and
>>> make sure all of the sha1 sums match before and after (and haven't changed).
>>>
>>> That will ensure that even if the algorithm is performing sub-optimally,
>>> it is at least performing *correctly*.
>> That makes sense.  Should we put it in now before 0.8 or hold it off?
>> The diff algorithm by itself seems fairly low risk.
> 
> I wasn't able to find the attachment on your email. Does the diff format
> with patience look similiar to the diff format for what users get today
> with "bzr diff"?
> 
> 
> Regards,
> James
> 

Short answer: yes

Longer answer:

Patience diff is just another way of generating a 'unified diff'. It is
a change to the difference engine. Not the difference viewer.

Long ramble about the difference:

difflib uses a 'longest common substring' match. So it compares two
texts, and finds the longest matching string, then ignores that section
to find the next longest, etc.

This gets hung up if you copy code from one function to another. Say you
first had:

def a_func():
  something
  somethingelse

Then you change that to:

def a_func():
  something
  somethingelse

def b_func():
  something_unique
  something
  somethingelse

If you have a lot of code below 'b_func' which wasn't changed, difflib
will tend to generate the above as:

 def a_func():
+ something
+ somethingelse
+
+def b_func():
+ something_unique
  something
  somethingelse

(Specifically, the 'something' and 'somethingelse' were shown as changed
in 'a' when really they were added by 'b'.

The Patience sorting matcher uses a different algorithm, which finds the
'longest common subset of unique lines'

So 'subset' means it doesn't have to be continuous, and unique lines
means it doesn't try to match 'something' or 'something else' in the
first round (it adds them in later.)

This tends to give better 'diff' performance, because unique matching
lines are much more likely to be the correct context, than lines which
have been duplicated.

You take a small performance penalty, though not major.

And I disagree with one of the implementation details by Bram.
Specifically, how we handle non-unique but matching lines. He felt that:
	XaaaY doesn't match any of QaaaP.

And he has a point that the matching is a little unsure, but Aaron & I
both felt that the 'aaa' should be matched.

(The unsure case is something like XaaaaY => QaaaP, which 'a' was
deleted? Or even worse XaaaaY => QaaPa, which one was deleted, was P
inserted, or were 2 lines deleted and one more 'a' was added.)

Because of the disagreement, I fall back to difflib (which will find
those matches), after using Patience in the first pass.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060226/a1cc7013/attachment.pgp