[RFC/MERGE] Avoid syncing on incorrect lines during diff

Tue Dec 19 14:57:26 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>> The attached patch changes the PatienceSequenceMatcher algorithm
>>> slightly, so that it checks for unique lines ignoring whitespace. It
>>> still matches exact texts, but if two lines only differ by whitespace,
>>> it does not consider them synchronization points.
> 
> Cool.  I like the way it correctly matched the "self._need_to_create =
> True" stuff.

Right, because it didn't get hung up on matching the trivial
'try:/else:' lines it was able to find real matches.

> 
> This implementation is a lot more invasive than I was thinking of.  I
> had imagined doing it as a wimpy wrapper:
> 
> def get_matchingblocks_no_ws(lines_a, lines_b):
>     a_no_ws = [l.strip() for l in lines_a]
>     b_no_ws = [l.strip() for l in lines_b
>     matcher = PatienceSequenceMatcher(None, a_no_ws, b_no_ws])
>     for a, b, n in matcher.get_matching_blocks():
>        for range(n):
>          if lines_a[a+n] == lines_b[b+n]:
>              # Here, we yield matches line-by-line because we're too
>              # lazy to collapse ranges.  Should be valid, though.
>              yield a+n, b+n, 1

This is actually a pretty different algorithm. This will tend to cause
things to line up when they only differ in whitespace. The final output
will still show them as different since you double check.

My patch works differently, in that it just tries to prevent the sync
points from being trivial lines.

Yours might actually have more value, I'm not sure yet. Because as you
point out, lines that change indent will now be sync points, rather than
being treated as just different texts.

> 
>>> In my (admittedly small) amount of testing, this again improves the
>>> human-readable quality of the generated diffs.
>>>
>>> I'm also attaching 2 diffs. One generated by bzr.dev, and the other
>>> after the fixes. This is actually the diff that motivated me to write
>>> this patch, because I thought we could do better. Hopefully BB won't be
>>> confused
> 
> Sorry, BB didn't see [MERGE, so it didn't detect this one.
> I wonder, should BB treat RFC/Merge as +0 by the author?

I think that fits our use case :)

> 
>>> 1) I haven't fully tested the performance impact of it. I tested that
>>> str.split() returns self if there is no whitespace removed, which seems
>>> nice, until I realized that every line is going to have at least a '\n'
>>> at the end of it.
> 
> Well, we could just strip leading whitespace.  But given recent events,
> it probably makes sense to handle trailing whitespace too...

My version would probably be served well by 'lstrip()', though it means
it would treat '  else: \n' as different from '  else:\n', which we
probably don't really want.

However, your proposal would also mean that when doing a diff, and
someone changes line endings or trailing whitespace, the line would be
more likely to stay in context. Because the first pass would actually
say that it had not changed. And it would only be the second pass that
says that it did. So yours is helped more by having trailing whitespace
also stripped.

Also, for source code, the point is pretty mute. I would guess that >
95% of all lines are indented to some degree, for most standard
languages. C/C++, python, java, etc. Even if it is only by convention,
all the real lines are indented. I suppose BrainF**k might stay left
aligned. But I've also seen that BF can be indented, it just doesn't
have the same conventions:
  http://en.wikipedia.org/wiki/Brainfuck
  http://www.iwriteiam.nl/Ha_BF.html
The way they usually look:
  http://www.hevanet.com/cristofd/brainfuck/dbfi.b
Or one with a bunch of comments:
  http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi1.b

Anyway, it would be reasonably to do your implementation. Another thing
we might consider, is if the size of the text gets too big, we could
break it into smaller chunks. That would sacrifice absolute correctness
for memory. Though we wouldn't save a lot of memory, since we still need
2 complete copies, but at least we wouldn't need 4.

> 
>>> 2) It hasn't been tested a lot with lots of different source texts.
>>> Though this would be a reason to put it into bzr, if only with a
>>> specific flag 'bzr diff --algorithm=no-whitespace'. It would be nice if
>>> we could say 'hey this looks bogus, would this other thing help?'.
>>> Though we also really need to do the opposite 'this looks reasonable, is
>>> this other one bogus?'.
> 
> Could also be helpful in merge.
> 

...

> I should also point out that it would solve diff-across-indent really,
> really well.
> 
> I don't have time ATM to give you a proper review.  Possibly this evening.
> 
> Aaron

For what you are proposing, I think we could easily do a
PatienceSequenceMatchWOWhitespace

Which simply is a wrapper around the existing one as you proposed. I
think it would solve the same problems that mine does, while also having
other benefits.

I would also guess that they have approximately the same performance
characteristics. Both of them need to re-evaluate every line that
matches to ensure that it actually does match.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFh/3WJdeBCYSNAAMRAs90AJ9Fs1RjCOoo0BA9mWK/mIpf2IzWdACdEI4B
00iWKVYtH9cUOrhGnUMnLGc=
=7Xj3
-----END PGP SIGNATURE-----