[RFC/MERGE] Avoid syncing on incorrect lines during diff

Tue Dec 19 14:23:21 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> The attached patch changes the PatienceSequenceMatcher algorithm
> slightly, so that it checks for unique lines ignoring whitespace. It
> still matches exact texts, but if two lines only differ by whitespace,
> it does not consider them synchronization points.

Cool.  I like the way it correctly matched the "self._need_to_create =
True" stuff.

This implementation is a lot more invasive than I was thinking of.  I
had imagined doing it as a wimpy wrapper:

def get_matchingblocks_no_ws(lines_a, lines_b):
    a_no_ws = [l.strip() for l in lines_a]
    b_no_ws = [l.strip() for l in lines_b
    matcher = PatienceSequenceMatcher(None, a_no_ws, b_no_ws])
    for a, b, n in matcher.get_matching_blocks():
       for range(n):
         if lines_a[a+n] == lines_b[b+n]:
             # Here, we yield matches line-by-line because we're too
             # lazy to collapse ranges.  Should be valid, though.
             yield a+n, b+n, 1

> In my (admittedly small) amount of testing, this again improves the
> human-readable quality of the generated diffs.
> 
> I'm also attaching 2 diffs. One generated by bzr.dev, and the other
> after the fixes. This is actually the diff that motivated me to write
> this patch, because I thought we could do better. Hopefully BB won't be
> confused

Sorry, BB didn't see [MERGE, so it didn't detect this one.
I wonder, should BB treat RFC/Merge as +0 by the author?

> 1) I haven't fully tested the performance impact of it. I tested that
> str.split() returns self if there is no whitespace removed, which seems
> nice, until I realized that every line is going to have at least a '\n'
> at the end of it.

Well, we could just strip leading whitespace.  But given recent events,
it probably makes sense to handle trailing whitespace too...

> 2) It hasn't been tested a lot with lots of different source texts.
> Though this would be a reason to put it into bzr, if only with a
> specific flag 'bzr diff --algorithm=no-whitespace'. It would be nice if
> we could say 'hey this looks bogus, would this other thing help?'.
> Though we also really need to do the opposite 'this looks reasonable, is
> this other one bogus?'.

Could also be helpful in merge.

> One possibility to help with (1) would be to use some sort of
> hash/checksum/etc for each line

Interesting thought.  Not sure how big a deal it is to keep spare copies
in memory, but this is probably solve it well.

> We also could use a bigger hash like md5/sha. But I worry about the
> overhead of doing that many big hashes, not to mention many lines are
> <20 characters, so the sha hash would actually end up using *more* memory.
> 
> I'd really like to get some feedback, though. Because I think something
> like this could be useful.

I should also point out that it would solve diff-across-indent really,
really well.

I don't have time ATM to give you a proper review.  Possibly this evening.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFh/XZ0F+nu1YWqI0RAkgSAKCFxFWa2hzmFicStmxcaO+QJTg73QCbBrvf
RMhoNwFj8osdgUvbTImArds=
=c1RX
-----END PGP SIGNATURE-----