[RFC/MERGE] Avoid syncing on incorrect lines during diff

Mon Dec 18 23:05:48 GMT 2006

The attached patch changes the PatienceSequenceMatcher algorithm
slightly, so that it checks for unique lines ignoring whitespace. It
still matches exact texts, but if two lines only differ by whitespace,
it does not consider them synchronization points.

In my (admittedly small) amount of testing, this again improves the
human-readable quality of the generated diffs.

I'm also attaching 2 diffs. One generated by bzr.dev, and the other
after the fixes. This is actually the diff that motivated me to write
this patch, because I thought we could do better. Hopefully BB won't be
confused, but I should be attaching them in the right order, and the
extras are plain diffs rather than bundles (though this helps bundles as
well).

I did trim out the parts that are identical in the 2 diffs, so that it
should be apparent what this changes.

I personally find the new diff easier to read, because it doesn't start
and stop like the old one does.

There are 2 reasons why this is currently RFC.

1) I haven't fully tested the performance impact of it. I tested that
str.split() returns self if there is no whitespace removed, which seems
nice, until I realized that every line is going to have at least a '\n'
at the end of it. So at the very least this means we end up holding 4
copies of the contents of a file in memory during diff. 1 for original,
1 for original without whitespace, 1 for new, and 1 more for new without
whitespace.
Also, there is a performance impact of iterating over all lines and
calling strip() on each one. It probably isn't huge versus the time
spent doing the comparisons. It also causes 1 more comparison in the
inner loop, since you are now comparing both the no-whitespace text, and
the with-whitespace text.

2) It hasn't been tested a lot with lots of different source texts.
Though this would be a reason to put it into bzr, if only with a
specific flag 'bzr diff --algorithm=no-whitespace'. It would be nice if
we could say 'hey this looks bogus, would this other thing help?'.
Though we also really need to do the opposite 'this looks reasonable, is
this other one bogus?'.

One possibility to help with (1) would be to use some sort of
hash/checksum/etc for each line, rather than using the actual line
contents. It would mean that we would get a few more lines that aren't
considered unique in the first pass. And it would be a random thing,
which could be weird. But likely there will be a nearby line which
*does* match, and then the cleanup part of the algorithm will keep that
from being exposed. So I would guess that failure mode would be very
unlikely. Also, if we use 'hash(str)' we get python's hash of the
string, which is a nice simple integer, which should generally avoid
collisions and should be fast to compute.

The algorithm is a little bit funny, as it starts with the length of the
string, and then mixes in the string bits. Try:

  hash('') == 0
  hash('\x00') == 1
  hash('\x00\x00\x00\x00') == 4
  hash('\x00\x00\x00\x01') == 5
  hash('\x00\x00\x01\x00') == 1000007 #???
also
  hash('\x00\x00\x00\x04') == 0 # it is obviously doing an XOR here

We also could use a bigger hash like md5/sha. But I worry about the
overhead of doing that many big hashes, not to mention many lines are
<20 characters, so the sha hash would actually end up using *more* memory.

I'd really like to get some feedback, though. Because I think something
like this could be useful.

John
=:->
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patience_no_whitespace.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20061218/42f18d6d/attachment.diff 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bzr_dev.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20061218/42f18d6d/attachment-0001.diff 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: no_whitespace.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20061218/42f18d6d/attachment-0002.diff