[RFC/MERGE] Avoid syncing on incorrect lines during diff

Martin Pool mbp at canonical.com
Tue Jan 9 08:34:09 GMT 2007


On 18 Dec 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> The attached patch changes the PatienceSequenceMatcher algorithm
> slightly, so that it checks for unique lines ignoring whitespace. It
> still matches exact texts, but if two lines only differ by whitespace,
> it does not consider them synchronization points.
> 
> In my (admittedly small) amount of testing, this again improves the
> human-readable quality of the generated diffs.
> 
> I'm also attaching 2 diffs. One generated by bzr.dev, and the other
> after the fixes. This is actually the diff that motivated me to write
> this patch, because I thought we could do better. Hopefully BB won't be
> confused, but I should be attaching them in the right order, and the
> extras are plain diffs rather than bundles (though this helps bundles as
> well).

So in those diffs, down at the bottom a function was entirely removed
and a new one inserted without (to a human) any real common content.
bzr.dev was matching on a line containing "else:", but your changes
avoid that and show just one big insertion and deletion, which I agree
is better.

In the section up the top (line 1123ff), it's also done a bit better.
One chunk of lines to do with handling NoSuchFile are identified as
being common rather than moved.

> I personally find the new diff easier to read, because it doesn't start
> and stop like the old one does.

On the whole I think I agree. 

> 
> One possibility to help with (1) would be to use some sort of
> hash/checksum/etc for each line, rather than using the actual line
> contents. It would mean that we would get a few more lines that aren't
> considered unique in the first pass. And it would be a random thing,
> which could be weird. But likely there will be a nearby line which
> *does* match, and then the cleanup part of the algorithm will keep that
> from being exposed. So I would guess that failure mode would be very
> unlikely. Also, if we use 'hash(str)' we get python's hash of the
> string, which is a nice simple integer, which should generally avoid
> collisions and should be fast to compute.
> 
> The algorithm is a little bit funny, as it starts with the length of the
> string, and then mixes in the string bits. Try:
> 
>   hash('') == 0
>   hash('\x00') == 1
>   hash('\x00\x00\x00\x00') == 4
>   hash('\x00\x00\x00\x01') == 5
>   hash('\x00\x00\x01\x00') == 1000007 #???
> also
>   hash('\x00\x00\x00\x04') == 0 # it is obviously doing an XOR here
> 
> We also could use a bigger hash like md5/sha. But I worry about the
> overhead of doing that many big hashes, not to mention many lines are
> <20 characters, so the sha hash would actually end up using *more* memory.
> 
> I'd really like to get some feedback, though. Because I think something
> like this could be useful.
> 
> John
> =:->

> # Bazaar revision bundle v0.8
> #
> # message:
> #   Improve the patience matching rules, to avoid artificial sync'ing during diff.
> # committer: John Arbash Meinel <john at arbash-meinel.com>
> # date: Mon 2006-12-18 16:44:19.892999887 -0600

> 2) It hasn't been tested a lot with lots of different source texts.
> Though this would be a reason to put it into bzr, if only with a
> specific flag 'bzr diff --algorithm=no-whitespace'. It would be nice if
> we could say 'hey this looks bogus, would this other thing help?'.
> Though we also really need to do the opposite 'this looks reasonable, is
> this other one bogus?'.
> 

That would be good, even if it's not the default it would let us try it
out.

> 
> === modified file NEWS
> --- NEWS
> +++ NEWS
> @@ -2,6 +2,11 @@
>  
>    IMPROVEMENTS:
>  
> +    * Patience diff is now generates even nicer diffs, by considering
> +      lines that only differ in whitespace to not be unique. This helps to
> +      prevent incorrect syncronizing when changing indentation, etc.
> +      (John Arbash Meinel)

'synchronizing'



> +
>      * New connection: ``bzr+http://`` which supports tunnelling the smart
>        protocol over an HTTP connection. If writing is enabled on the bzr
>        server, then you can write over the http connection.
> 
> === modified file bzrlib/patiencediff.py
> --- bzrlib/patiencediff.py
> +++ bzrlib/patiencediff.py
> @@ -28,11 +28,13 @@
>  __all__ = ['PatienceSequenceMatcher', 'unified_diff', 'unified_diff_files']
>  
>  
> -def unique_lcs(a, b):
> +def unique_lcs(a, b, a_no_whitespace=None, b_no_whitespace=None):
>      """Find the longest common subset for unique lines.
>  
>      :param a: An indexable object (such as string or list of strings)
>      :param b: Another indexable object (such as string or list of strings)
> +    :param a_no_whitespace: Lines of a with whitespace removed
> +    :param b_no_whitespace: Lines of b with whitespace removed
>      :return: A list of tuples, one for each line which is matched.
>              [(line_in_a, line_in_b), ...]

s//leading and trailing whitespace//

I wonder - is it really necessary to hardcode that this is about removal
of whitespace?  Would it perhaps be cleaner to take a function that
conditions the lines on both sides...

Would it work to have a dictionary mapping from the line to the stripped
(or conditioned) version of that line?  Then you could avoid doing the
stripping twice for line texts repeated either in both files or within a
file, which we expect to be reasonably common...  Anyhow that can wait
for it to be profiled.

-- 
Martin



More information about the bazaar mailing list