[MERGE] Patience diff

Tue May 23 13:59:44 BST 2006

Aaron Bentley wrote:
> John Arbash Meinel wrote:
> |>| is different, I would tend to change it to:
> |>|
> |>| new_file = file(file_path, 'rb')
> |>| assert list(new_file) == new_lines
> |>
> |>Makes sense.
> 
> Done.
> 
> |>| We should go through and make sure that our copyright statements have
> |>| 2006, and use # all the way through. (bzrlib/cdv/__init__.py doesn't
> |>| have the right lines in the beginning)
> |>
> |>It's also attributed to Canonical, and I think it should be attributed
> |>to Bram Cohen, at least in part.
> |
> |
> | Well __init__.py is my addition, and can be attributed to Canonical. But
> | nofrillsprecisemerge.py should be attributed to Bram.
> 
> Okay, the new files are attributed correctly.  I've merged nofrills with
> yours, and done a shared attribution.
> 
> |>| We should also check that we have the latest 'nofrillsprecisemerge.py'
> |>| files. I tried hard to make sure I was using an unmodified version so
> |>| that any updates could be easily obtained.
> 
> I checked the latest on revctrl.org.  It seems our version is the
> newest.  Actually, someone added a bunch of epydoc-style comments to our
> version, so it's not a 1-1 match anymore.

Actually, that would probably be me. :) I went through and tried to
document what the different parameters were, just so I could follow what
was going on.

> 
> |>Perhaps we should pull the nofrills stuff into cdvdifflib, and rename it
> |>to bzrlib/patiencediff.py
> |
> |
> | I would be okay with it going either way.
> 
> Okay, I've done this.
> 
> |>| In merge3.py we have a commented out import. I think we prefer to just
> |>| remove them now.
> |>
> 
> Fixed.
> 
> |>| We also used to use a junk expression so that lines that only contained
> |>| space,tab or # would be ignored. It probably isn't needed anymore, but
> |>| we should be aware that the junk_re should be deleted.
> 
> I've deleted it.
> 
> |>| test_diff.py has a bunch of pep8 whitespace issues.
> 
> I've done my best, but if there's anything left, let me know.
> 
> | Well, ultimately I think this is just a tuned version of the original
> | SequenceMatcher. I see that Robert was the one making the tuning, so we
> | should ask him. But it doesn't benefit us much to include Patience diff
> | if our primary storage mechanism isn't going to use it.
> |
> | I wasn't part of the tuning, so I don't know what it involved, or how
> | much work it would be for him to tune Patience diff.
> 
> Well, merge and diff will benefit, but it would be nice to have annotate
> and knit merge benefit, too.
> 
> How do you like me now?
> 
> Aaron

------------------------------------------------------------------------

=== added file 'bzrlib/patiencediff.py'

--- /dev/null	
+++ bzrlib/patiencediff.py	
@@ -0,0 +1,392 @@
+#!/usr/bin/env python
+# Copyright (C) 2005 Bram Cohen, Copyright (C) 2005, 2006 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+
+from bisect import bisect
+from copy import copy
+import difflib
+import time
+import os
+import sys


PEP8 here


+
+__all__ = ['SequenceMatcher', 'unified_diff', 'unified_diff_files']
+
+def unique_lcs(a, b):
+    """Find the longest common subset for unique lines.
+
+    :param a: An indexable object (such as string or list of strings)
+    :param b: Another indexable object (such as string or list of strings)
+    :return: A list of tuples, one for each line which is matched.
+            [(line_in_a, line_in_b), ...]
+
+    This only matches lines which are unique on both sides.
+    This helps prevent common lines from over influencing match
+    results.
+    The longest common subset uses the Patience Sorting algorithm:
+    http://en.wikipedia.org/wiki/Patience_sorting
+    """
+    # set index[line in a] = position of line in a unless
+    # unless a is a duplicate, in which case it's set to None
+    index = {}
+    for i in xrange(len(a)):
+        line = a[i]
+        if line in index:
+            index[line] = None
+        else:
+            index[line]= i
+    # make btoa[i] = position of line i in a, unless
+    # that line doesn't occur exactly once in both,
+    # in which case it's set to None
+    btoa = [None] * len(b)
+    index2 = {}
+    for pos, line in enumerate(b):
+        next = index.get(line)
+        if next is not None:
+            if line in index2:
+                # unset the previous mapping, which we now know to
+                # be invalid because the line isn't unique
+                btoa[index2[line]] = None
+                del index[line]
+            else:
+                index2[line] = pos
+                btoa[pos] = next
+    # this is the Patience sorting algorithm
+    # see http://en.wikipedia.org/wiki/Patience_sorting
+    backpointers = [None] * len(b)
+    stacks = []
+    lasts = []
+    k = 0
+    for bpos, apos in enumerate(btoa):
+        if apos is None:
+            continue
+        # as an optimization, check if the next line comes at the end,
+        # because it usually does
+        if stacks and stacks[-1] < apos:
+            k = len(stacks)
+        # as an optimization, check if the next line comes right after
+        # the previous line, because usually it does
+        elif stacks and stacks[k] < apos and (k == len(stacks) - 1 or
stacks[k+1] > apos):
+            k += 1
+        else:
+            k = bisect(stacks, apos)
+        if k > 0:
+            backpointers[bpos] = lasts[k-1]
+        if k < len(stacks):
+            stacks[k] = apos
+            lasts[k] = bpos
+        else:
+            stacks.append(apos)
+            lasts.append(bpos)
+    if len(lasts) == 0:
+        return []
+    result = []
+    k = lasts[-1]
+    while k is not None:
+        result.append((btoa[k], k))
+        k = backpointers[k]
+    result.reverse()
+    return result

PEP8. Also, Bram seemed to like running self tests every time the module
was imported. Which are what all these asserts are for.
They should be fast, so I'm okay with keeping them, but I did add them
to selftest, so it would be safe to get rid of them.

+
+assert unique_lcs('', '') == []
+assert unique_lcs('a', 'a') == [(0, 0)]
+assert unique_lcs('a', 'b') == []
+assert unique_lcs('ab', 'ab') == [(0, 0), (1, 1)]
+assert unique_lcs('abcde', 'cdeab') == [(2, 0), (3, 1), (4, 2)]
+assert unique_lcs('cdeab', 'abcde') == [(0, 2), (1, 3), (2, 4)]
+assert unique_lcs('abXde', 'abYde') == [(0, 0), (1, 1), (3, 3), (4, 4)]
+assert unique_lcs('acbac', 'abc') == [(2, 1)]
+
+def recurse_matches(a, b, ahi, bhi, answer, maxrecursion):
+    """Find all of the matching text in the lines of a and b.
+
+    :param a: A sequence
+    :param b: Another sequence
+    :param ahi: The maximum length of a to check, typically len(a)
+    :param bhi: The maximum length of b to check, typically len(b)
+    :param answer: The return array. Will be filled with tuples
+                   indicating [(line_in_a), (line_in_b)]
+    :param maxrecursion: The maximum depth to recurse.
+                         Must be a positive integer.
+    :return: None, the return value is in the parameter answer, which
+             should be a list
+
+    """
+    oldlen = len(answer)
+    if maxrecursion < 0:
+        # this will never happen normally, this check is to prevent DOS
attacks
+        return
+    oldlength = len(answer)
+    if len(answer) == 0:
+        alo, blo = 0, 0
+    else:
+        alo, blo = answer[-1]
+        alo += 1
+        blo += 1
+    if alo == ahi or blo == bhi:
+        return
+    for apos, bpos in unique_lcs(a[alo:ahi], b[blo:bhi]):
+        # recurse between lines which are unique in each file and match
+        apos += alo
+        bpos += blo
+        recurse_matches(a, b, apos, bpos, answer, maxrecursion - 1)
+        answer.append((apos, bpos))
+    if len(answer) > oldlength:
+        # find matches between the last match and the end
+        recurse_matches(a, b, ahi, bhi, answer, maxrecursion - 1)
+    elif a[alo] == b[blo]:
+        # find matching lines at the very beginning
+        while alo < ahi and blo < bhi and a[alo] == b[blo]:
+            answer.append((alo, blo))
+            alo += 1
+            blo += 1
+        recurse_matches(a, b, ahi, bhi, answer, maxrecursion - 1)
+    elif a[ahi - 1] == b[bhi - 1]:
+        # find matching lines at the very end
+        nahi = ahi - 1
+        nbhi = bhi - 1
+        while nahi > alo and nbhi > blo and a[nahi - 1] == b[nbhi - 1]:
+            nahi -= 1
+            nbhi -= 1
+        recurse_matches(a, b, nahi, nbhi, answer, maxrecursion - 1)
+        for i in xrange(ahi - nahi):
+            answer.append((nahi + i, nbhi + i))
+


Same thing here (this is selftest code)


+a1 = []
+recurse_matches(['a', None, 'b', None, 'c'], ['a', 'a', 'b', 'c', 'c'],
5, 5, a1, 10)
+assert a1 == [(0, 0), (2, 2), (4, 4)]
+a2 = []
+recurse_matches(['a', 'c', 'b', 'a', 'c'], ['a', 'b', 'c'], 5, 3, a2, 10)
+assert  a2 == [(0, 0), (2, 1), (4, 2)]
+
+a3 = []
+recurse_matches(['a', 'B', 'c', 'c', 'D', 'e'], ['a', 'b', 'c', 'c',
'd', 'e'], 6, 6, a3, 10)
+# FIXME: recurse_matches won't match non-unique lines, surrounded by
bogus text
+# This is what it should be
+#assert a2 == [(0,0), (2,2), (3,3), (5,5)]
+# This is what it currently gives:
+assert a3 == [(0,0), (5,5)]
+
+
+class SequenceMatcher(difflib.SequenceMatcher):
+    """Compare a pair of sequences using longest common subset."""
+
+    def __init__(self, isjunk=None, a='', b=''):
+        if isjunk is not None:
+            raise NotImplementedError('Currently we do not support'
+                                      ' isjunk for sequence matching')
+        difflib.SequenceMatcher.__init__(self, isjunk, a, b)
+
+    def _check_with_diff(self, alo, ahi, blo, bhi, answer):
+        """Use the original diff algorithm on an unmatched section.
+
+        This will check to make sure the range is worth checking,
+        before doing any work.
+
+        :param alo: The last line that actually matched
+        :param ahi: The next line that actually matches
+        :param blo: Same as alo, only for the 'b' set
+        :param bhi: Same as ahi
+        :param answer: An array which will have the new ranges appended
to it
+        :return: None
+        """
+        # WORKAROUND
+        # recurse_matches has an implementation design
+        # which does not match non-unique lines in the
+        # if they do not touch matching unique lines
+        # so we rerun the regular diff algorithm
+        # if find a large enough chunk.
+
+        # recurse_matches already looked at the direct
+        # neighbors, so we only need to run if there is
+        # enough space to do so
+        if ahi - alo > 2 and bhi - blo > 2:
+            a = self.a[alo+1:ahi-1]
+            b = self.b[blo+1:bhi-1]
+            m = difflib.SequenceMatcher(None, a, b)
+            new_blocks = m.get_matching_blocks()
+            # difflib always adds a final match
+            new_blocks.pop()
+            for blk in new_blocks:
+                answer.append((blk[0]+alo+1,
+                               blk[1]+blo+1,
+                               blk[2]))
+
+    def __helper(self, alo, ahi, blo, bhi, answer):
+        matches = []
+        a = self.a[alo:ahi]
+        b = self.b[blo:bhi]
+        recurse_matches(a, b, len(a), len(b), matches, 10)
+        # Matches now has individual line pairs of
+        # line A matches line B, at the given offsets
+
+        start_a = start_b = None
+        length = 0
+        for i_a, i_b in matches:
+            if (start_a is not None
+                and (i_a == start_a + length)
+                and (i_b == start_b + length)):
+                length += 1
+            else:
+                # New block
+                if start_a is None:
+                    # We need to check from 0,0 until the current match
+                    self._check_with_diff(alo-1, i_a+alo, blo-1,
i_b+blo, answer)
+                else:
+                    answer.append((start_a+alo, start_b+blo, length))
+                    self._check_with_diff(start_a+alo+length, i_a+alo,
+                                          start_b+blo+length, i_b+blo,
+                                          answer)
+
+                start_a = i_a
+                start_b = i_b
+                length = 1
+
+        if length != 0:
+            answer.append((start_a+alo, start_b+blo, length))
+            self._check_with_diff(start_a+alo+length, ahi+1,
+                                  start_b+blo+length, bhi+1,
+                                  answer)
+        if not matches:
+            # Nothing matched, so we need to send the complete text
+            self._check_with_diff(alo-1, ahi+1, blo-1, bhi+1, answer)
+
+        # For consistency sake, make sure all matches are only increasing
+        if __debug__:
+            next_a = -1
+            next_b = -1
+            for a,b,match_len in answer:
+                assert a >= next_a, 'Non increasing matches for a'
+                assert b >= next_b, 'Not increasing matches for b'
+                next_a = a + match_len
+                next_b = b + match_len
+


PEP8


+# This is a version of unified_diff which only adds a factory parameter
+# so that you can override the default SequenceMatcher
+# this has been submitted as a patch to python
+
+def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
+                 tofiledate='', n=3, lineterm='\n',
+                 sequencematcher=None):
+    r"""
+    Compare two sequences of lines; generate the delta as a unified diff.
+
+    Unified diffs are a compact way of showing line changes and a few
+    lines of context.  The number of context lines is set by 'n' which
+    defaults to three.
+
+    By default, the diff control lines (those with ---, +++, or @@) are
+    created with a trailing newline.  This is helpful so that inputs
+    created from file.readlines() result in diffs that are suitable for
+    file.writelines() since both the inputs and outputs have trailing
+    newlines.
+
+    For inputs that do not have trailing newlines, set the lineterm
+    argument to "" so that the output will be uniformly newline free.
+
+    The unidiff format normally has a header for filenames and modification
+    times.  Any or all of these may be specified using strings for
+    'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.  The
modification
+    times are normally expressed in the format returned by time.ctime().
+
+    Example:
+
+    >>> for line in unified_diff('one two three four'.split(),
+    ...             'zero one tree four'.split(), 'Original', 'Current',
+    ...             'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
+    ...             lineterm=''):
+    ...     print line
+    --- Original Sat Jan 26 23:30:50 1991
+    +++ Current Fri Jun 06 10:20:52 2003
+    @@ -1,4 +1,4 @@
+    +zero
+     one
+    -two
+    -three
+    +tree
+     four
+    """
+    if sequencematcher is None:
+        sequencematcher = difflib.SequenceMatcher
+
+    started = False
+    for group in sequencematcher(None,a,b).get_grouped_opcodes(n):
+        if not started:
+            yield '--- %s %s%s' % (fromfile, fromfiledate, lineterm)
+            yield '+++ %s %s%s' % (tofile, tofiledate, lineterm)
+            started = True
+        i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3],
group[-1][4]
+        yield "@@ -%d,%d +%d,%d @@%s" % (i1+1, i2-i1, j1+1, j2-j1,
lineterm)
+        for tag, i1, i2, j1, j2 in group:
+            if tag == 'equal':
+                for line in a[i1:i2]:
+                    yield ' ' + line
+                continue
+            if tag == 'replace' or tag == 'delete':
+                for line in a[i1:i2]:
+                    yield '-' + line
+            if tag == 'replace' or tag == 'insert':
+                for line in b[j1:j2]:
+                    yield '+' + line


PEP8

+
+def unified_diff_files(a, b, sequencematcher=None):
+    """Generate the diff for two files.
+    """
+    # Should this actually be an error?
+    if a == b:
+        return []
+    if a == '-':
+        file_a = sys.stdin
+        time_a = time.time()
+    else:
+        file_a = open(a, 'rb')
+        time_a = os.stat(a).st_mtime
+
+    if b == '-':
+        file_b = sys.stdin
+        time_b = time.time()
+    else:
+        file_b = open(b, 'rb')
+        time_b = os.stat(b).st_mtime
+
+    # TODO: Include fromfiledate and tofiledate
+    return unified_diff(file_a.readlines(), file_b.readlines(),
+                        fromfile=a, tofile=b,
+                        sequencematcher=sequencematcher)
+
+def main(args):
+    import optparse
+    p = optparse.OptionParser(usage='%prog [options] file_a file_b'
+                                    '\nFiles can be "-" to read from
stdin')
+    p.add_option('--cdv', dest='matcher', action='store_const',
const='cdv',
+                 default='cdv', help='Use the cdv difference algorithm')
+    p.add_option('--difflib', dest='matcher', action='store_const',
const='difflib',
+                 default='cdv', help='Use python\'s difflib algorithm')
+
+    algorithms = {'cdv':SequenceMatcher, 'difflib':difflib.SequenceMatcher}
+
+    (opts, args) = p.parse_args(args)
+    matcher = algorithms[opts.matcher]
+
+    if len(args) != 2:
+        print 'You must supply 2 filenames to diff'
+        return -1
+
+    for line in unified_diff_files(args[0], args[1],
sequencematcher=matcher):
+        sys.stdout.write(line)
+
+if __name__ == '__main__':
+    sys.exit(main(sys.argv[1:]))

=== added file 'patience-test.py'
--- /dev/null	
+++ patience-test.py	
@@ -0,0 +1,90 @@
+#!/usr/bin/env python2.4
+# Copyright (C) 2006 by Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+import difflib
+from StringIO import StringIO
+from subprocess import Popen, PIPE
+from tempfile import mkdtemp
+
+from bzrlib.branch import Branch
+from bzrlib.patiencediff import SequenceMatcher
+from bzrlib.diff import internal_diff
+from bzrlib.osutils import pathjoin
+
+
+def patch(path, patch_lines):
+    """Apply a patch to a branch, using patch(1).  URLs may be used."""
+    cmd = ['patch', '--quiet', path]
+    r = 0
+    child_proc = Popen(cmd, stdin=PIPE)
+    for line in patch_lines:
+        child_proc.stdin.write(line)
+    child_proc.stdin.close()
+    r = child_proc.wait()
+    return r
+
+
+old_total = 0
+new_total = 0
+b = Branch.open_containing('.')[0]
+repo = b.repository
+repo.lock_write()
+try:
+    temp_dir = mkdtemp()
+    transaction = repo.get_transaction()
+    file_list = list(repo.text_store)
+    for i, file_id in enumerate(file_list):
+        print "%.2f%% %d of %d %s" % ((float(i)/len(file_list) * 100), i,
+                                    len(file_list), file_id)
+        versioned_file = repo.text_store.get_weave(file_id, transaction)
+        last_id = None
+        for revision_id in versioned_file.versions():
+            if last_id != None:
+                old_lines = versioned_file.get_lines(last_id)
+                new_lines = versioned_file.get_lines(revision_id)
+                if ''.join(old_lines) == ''.join(new_lines):
+                    continue
+
+                new_patch = StringIO()
+                try:
+                    internal_diff('old', old_lines, 'new', new_lines,
new_patch,
+                                  sequence_matcher=SequenceMatcher)
+                except:
+                    file(pathjoin(temp_dir, 'old'),
+                         'wb').write(''.join(old_lines))
+                    file(pathjoin(temp_dir, 'new'),
+                         'wb').write(''.join(new_lines))
+                    print "temp dir is %s" % temp_dir
+                    raise
+                old_patch = StringIO()
+                internal_diff('old', old_lines, 'new', new_lines,
old_patch,
+                              sequence_matcher=difflib.SequenceMatcher)
+                old_total += len(old_patch.getvalue())
+                new_total += len(new_patch.getvalue())
+                new_patch.seek(0)
+                file_path = pathjoin(temp_dir, 'file')
+                orig_file = file(file_path, 'wb')
+                for line in old_lines:
+                    orig_file.write(line)
+                orig_file.close()
+                patch(file_path, new_patch)
+                new_file = file(file_path, 'rb')
+                assert list(new_file) == new_lines
+            last_id = revision_id
+    print old_total, new_total
+finally:
+    repo.unlock()


Did you re-run the tests with the new patience-test.py? I'm guessing it
still worked, but just thought I would check.


=== modified file 'bzrlib/diff.py'
--- bzrlib/diff.py	
+++ bzrlib/diff.py	
@@ -17,6 +17,7 @@
 from bzrlib.delta import compare_trees
 from bzrlib.errors import BzrError
 import bzrlib.errors as errors
+from bzrlib.patiencediff import SequenceMatcher, unified_diff
 from bzrlib.symbol_versioning import *
 from bzrlib.textfile import check_text_lines
 from bzrlib.trace import mutter
@@ -26,9 +27,7 @@
 # list, write them out directly, etc etc.

 def internal_diff(old_filename, oldlines, new_filename, newlines, to_file,
-                  allow_binary=False):
-    import difflib
-
+                  allow_binary=False, sequence_matcher=None):
     # FIXME: difflib is wrong if there is no trailing newline.
     # The syntax used by patch seems to be "\ No newline at
     # end of file" following the last diff line from that
@@ -49,9 +48,12 @@
         check_text_lines(oldlines)
         check_text_lines(newlines)

-    ud = difflib.unified_diff(oldlines, newlines,
-                              fromfile=old_filename+'\t',
-                              tofile=new_filename+'\t')
+    if sequence_matcher is None:
+        sequence_matcher = SequenceMatcher
+    ud = unified_diff(oldlines, newlines,
+                      fromfile=old_filename+'\t',
+                      tofile=new_filename+'\t',
+                      sequencematcher=sequence_matcher)

     ud = list(ud)
     # work-around for difflib being too smart for its own good

=== modified file 'bzrlib/merge3.py'
--- bzrlib/merge3.py	
+++ bzrlib/merge3.py	
@@ -19,9 +19,8 @@
 # s: "i hate that."


-from difflib import SequenceMatcher
-
 from bzrlib.errors import CantReprocessAndShowBase
+from bzrlib.patiencediff import SequenceMatcher
 from bzrlib.textfile import check_text_lines

 def intersect(ra, rb):
@@ -384,14 +383,8 @@

     def find_unconflicted(self):
         """Return a list of ranges in base that are not conflicted."""
-
-        import re
-
-        # don't sync-up on lines containing only blanks or pounds
-        junk_re = re.compile(r'^[ \t#]*$')
-
-        am = SequenceMatcher(junk_re.match, self.base,
self.a).get_matching_blocks()
-        bm = SequenceMatcher(junk_re.match, self.base,
self.b).get_matching_blocks()
+        am = SequenceMatcher(None, self.base, self.a).get_matching_blocks()
+        bm = SequenceMatcher(None, self.base, self.b).get_matching_blocks()

         unc = []


=== modified file 'bzrlib/tests/test_diff.py'
--- bzrlib/tests/test_diff.py	
+++ bzrlib/tests/test_diff.py	
@@ -2,7 +2,9 @@

 from bzrlib.diff import internal_diff
 from bzrlib.errors import BinaryFile
-from bzrlib.tests import TestCase
+from bzrlib.patiencediff import (recurse_matches, SequenceMatcher,
unique_lcs,
+                                 unified_diff, unified_diff_files)
+from bzrlib.tests import TestCase, TestCaseInTempDir


 def udiff_lines(old, new, allow_binary=False):
@@ -11,7 +13,9 @@
     output.seek(0, 0)
     return output.readlines()

+
 class TestDiff(TestCase):
+
     def test_add_nl(self):
         """diff generates a valid diff for patches that add a newline"""
         lines = udiff_lines(['boo'], ['boo\n'])
@@ -56,3 +60,328 @@
         self.assertRaises(BinaryFile, udiff_lines, [], [1023 * 'a' +
'\x00'])
         udiff_lines([1023 * 'a' + '\x00'], [], allow_binary=True)
         udiff_lines([], [1023 * 'a' + '\x00'], allow_binary=True)
+
+
+class TestCDVDiffLib(TestCase):
+
+    def test_unique_lcs(self):
+        self.assertEquals(unique_lcs('', ''), [])
+        self.assertEquals(unique_lcs('a', 'a'), [(0,0)])
+        self.assertEquals(unique_lcs('a', 'b'), [])
+        self.assertEquals(unique_lcs('ab', 'ab'), [(0,0), (1,1)])
+        self.assertEquals(unique_lcs('abcde', 'cdeab'), [(2,0), (3,1),
(4,2)])
+        self.assertEquals(unique_lcs('cdeab', 'abcde'), [(0,2), (1,3),
(2,4)])
+        self.assertEquals(unique_lcs('abXde', 'abYde'), [(0,0), (1,1),
+                                                         (3,3), (4,4)])
+        self.assertEquals(unique_lcs('acbac', 'abc'), [(2,1)])
+
+    def test_recurse_matches(self):
+        def test_one(a, b, matches):
+            test_matches = []
+            recurse_matches(a, b, len(a), len(b), test_matches, 10)
+            self.assertEquals(test_matches, matches)
+
+        test_one(['a', None, 'b', None, 'c'], ['a', 'a', 'b', 'c', 'c'],
+                 [(0, 0), (2, 2), (4, 4)])
+        test_one(['a', 'c', 'b', 'a', 'c'], ['a', 'b', 'c'],
+                 [(0, 0), (2, 1), (4, 2)])
+
+        # recurse_matches doesn't match non-unique
+        # lines surrounded by bogus text.
+        # The update has been done in patiencediff.SequenceMatcher instead
+
+        # This is what it could be
+        #test_one('aBccDe', 'abccde', [(0,0), (2,2), (3,3), (5,5)])
+
+        # This is what it currently gives:
+        test_one('aBccDe', 'abccde', [(0,0), (5,5)])
+
+    def test_matching_blocks(self):
+        def chk_blocks(a, b, matching):
+            # difflib always adds a signature of the total
+            # length, with no matching entries at the end
+            s = SequenceMatcher(None, a, b)
+            blocks = s.get_matching_blocks()
+            x = blocks.pop()
+            self.assertEquals(x, (len(a), len(b), 0))
+            self.assertEquals(matching, blocks)
+
+        # Some basic matching tests
+        chk_blocks('', '', [])
+        chk_blocks([], [], [])
+        chk_blocks('abcd', 'abcd', [(0, 0, 4)])
+        chk_blocks('abcd', 'abce', [(0, 0, 3)])
+        chk_blocks('eabc', 'abce', [(1, 0, 3)])
+        chk_blocks('eabce', 'abce', [(1, 0, 4)])
+        chk_blocks('abcde', 'abXde', [(0, 0, 2), (3, 3, 2)])
+        chk_blocks('abcde', 'abXYZde', [(0, 0, 2), (3, 5, 2)])
+        chk_blocks('abde', 'abXYZde', [(0, 0, 2), (2, 5, 2)])
+        # This may check too much, but it checks to see that
+        # a copied block stays attached to the previous section,
+        # not the later one.
+        # difflib would tend to grab the trailing longest match
+        # which would make the diff not look right
+        chk_blocks('abcdefghijklmnop', 'abcdefxydefghijklmnop',
+                   [(0, 0, 6), (6, 11, 10)])
+
+        # make sure it supports passing in lists
+        chk_blocks(
+                [ 'hello there\n'
+                , 'world\n'
+                , 'how are you today?\n'],
+                [ 'hello there\n'
+                , 'how are you today?\n'],
+                [(0, 0, 1), (2, 1, 1)])
+
+        chk_blocks('aBccDe', 'abccde', [(0,0,1), (2,2,2), (5,5,1)])
+
+        chk_blocks('aBcdEcdFg', 'abcdecdfg', [(0,0,1), (2,2,2),
+                                              (5,5,2), (8,8,1)])
+
+        chk_blocks('abbabbXd', 'cabbabxd', [(0,1,5), (7,7,1)])
+        chk_blocks('abbabbbb', 'cabbabbc', [(0,1,6)])
+        chk_blocks('bbbbbbbb', 'cbbbbbbc', [(0,1,6)])
+
+    def test_opcodes(self):
+        def chk_ops(a, b, codes):
+            s = SequenceMatcher(None, a, b)
+            self.assertEquals(codes, s.get_opcodes())
+
+        chk_ops('', '', [])
+        chk_ops([], [], [])
+        chk_ops('abcd', 'abcd', [('equal',    0,4, 0,4)])
+        chk_ops('abcd', 'abce', [ ('equal',   0,3, 0,3)
+                                , ('replace', 3,4, 3,4)
+                                ])
+        chk_ops('eabc', 'abce', [ ('delete', 0,1, 0,0)
+                                , ('equal',  1,4, 0,3)
+                                , ('insert', 4,4, 3,4)
+                                ])
+        chk_ops('eabce', 'abce', [ ('delete', 0,1, 0,0)
+                                 , ('equal',  1,5, 0,4)
+                                 ])
+        chk_ops('abcde', 'abXde', [ ('equal',   0,2, 0,2)
+                                  , ('replace', 2,3, 2,3)
+                                  , ('equal',   3,5, 3,5)
+                                  ])
+        chk_ops('abcde', 'abXYZde', [ ('equal',   0,2, 0,2)
+                                    , ('replace', 2,3, 2,5)
+                                    , ('equal',   3,5, 5,7)
+                                    ])
+        chk_ops('abde', 'abXYZde', [ ('equal',  0,2, 0,2)
+                                   , ('insert', 2,2, 2,5)
+                                   , ('equal',  2,4, 5,7)
+                                   ])
+        chk_ops('abcdefghijklmnop', 'abcdefxydefghijklmnop',
+                [ ('equal',  0,6,  0,6)
+                , ('insert', 6,6,  6,11)
+                , ('equal',  6,16, 11,21)
+                ])
+        chk_ops(
+                [ 'hello there\n'
+                , 'world\n'
+                , 'how are you today?\n'],
+                [ 'hello there\n'
+                , 'how are you today?\n'],
+                [ ('equal',  0,1, 0,1)
+                , ('delete', 1,2, 1,1)
+                , ('equal',  2,3, 1,2)
+                ])
+        chk_ops('aBccDe', 'abccde',
+                [ ('equal',   0,1, 0,1)
+                , ('replace', 1,2, 1,2)
+                , ('equal',   2,4, 2,4)
+                , ('replace', 4,5, 4,5)
+                , ('equal',   5,6, 5,6)
+                ])
+        chk_ops('aBcdEcdFg', 'abcdecdfg',
+                [ ('equal',   0,1, 0,1)
+                , ('replace', 1,2, 1,2)
+                , ('equal',   2,4, 2,4)
+                , ('replace', 4,5, 4,5)
+                , ('equal',   5,7, 5,7)
+                , ('replace', 7,8, 7,8)
+                , ('equal',   8,9, 8,9)
+                ])
+
+    def test_multiple_ranges(self):
+        # There was an earlier bug where we used a bad set of ranges,
+        # this triggers that specific bug, to make sure it doesn't regress
+        def chk_blocks(a, b, matching):
+            # difflib always adds a signature of the total
+            # length, with no matching entries at the end
+            s = SequenceMatcher(None, a, b)
+            blocks = s.get_matching_blocks()
+            x = blocks.pop()
+            self.assertEquals(x, (len(a), len(b), 0))
+            self.assertEquals(matching, blocks)
+
+        chk_blocks('abcdefghijklmnop'
+                 , 'abcXghiYZQRSTUVWXYZijklmnop'
+                 , [(0, 0, 3), (6, 4, 3), (9, 20, 7)])
+
+        chk_blocks('ABCd efghIjk  L'
+                 , 'AxyzBCn mo pqrstuvwI1 2  L'
+                 , [(0,0,1), (1, 4, 2), (4, 7, 1), (9, 19, 1), (12, 23,
3)])
+

Is this the best way to write it? Basically, I needed a multi-line code
snippet, that I could do diff with, and make sure it gave a reasonable
answer. (It also highlights the difference between difflib and
patiencediff).

I suppose we could move the static text out of the function and into the
 module scope, it may make it look a little cleaner. I don't really know
what the 'best' thing here is.


+        chk_blocks('''\
+    get added when you add a file in the directory.
+    """
+    takes_args = ['file*']
+    takes_options = ['no-recurse']
+
+    def run(self, file_list, no_recurse=False):
+        from bzrlib.add import smart_add, add_reporter_print,
add_reporter_null
+        if is_quiet():
+            reporter = add_reporter_null
+        else:
+            reporter = add_reporter_print
+        smart_add(file_list, not no_recurse, reporter)
+
+
+class cmd_mkdir(Command):
+'''.splitlines(True)
+, '''\
+    get added when you add a file in the directory.
+
+    --dry-run will show which files would be added, but not actually
+    add them.
+    """
+    takes_args = ['file*']
+    takes_options = ['no-recurse', 'dry-run']
+
+    def run(self, file_list, no_recurse=False, dry_run=False):
+        import bzrlib.add
+
+        if dry_run:
+            if is_quiet():
+                # This is pointless, but I'd rather not raise an error
+                action = bzrlib.add.add_action_null
+            else:
+                action = bzrlib.add.add_action_print
+        elif is_quiet():
+            action = bzrlib.add.add_action_add
+        else:
+            action = bzrlib.add.add_action_add_and_print
+
+        bzrlib.add.smart_add(file_list, not no_recurse, action)
+
+
+class cmd_mkdir(Command):
+'''.splitlines(True)
+, [(0,0,1), (1, 4, 2), (9, 19, 1), (12, 23, 3)])
+
+    def test_cdv_unified_diff(self):
+        txt_a = [ 'hello there\n'
+                , 'world\n'
+                , 'how are you today?\n']
+        txt_b = [ 'hello there\n'
+                , 'how are you today?\n']
+        self.assertEquals([ '---  \n'
+                          , '+++  \n'
+                          , '@@ -1,3 +1,2 @@\n'
+                          , ' hello there\n'
+                          , '-world\n'
+                          , ' how are you today?\n'
+                          ]
+                          , list(unified_diff(txt_a, txt_b
+                                        ,
sequencematcher=SequenceMatcher)))
+        txt_a = map(lambda x: x+'\n', 'abcdefghijklmnop')
+        txt_b = map(lambda x: x+'\n', 'abcdefxydefghijklmnop')
+        # This is the result with LongestCommonSubstring matching
+        self.assertEquals([ '---  \n'
+                          , '+++  \n'
+                          , '@@ -1,6 +1,11 @@\n'
+                          , ' a\n'
+                          , ' b\n'
+                          , ' c\n'
+                          , '+d\n'
+                          , '+e\n'
+                          , '+f\n'
+                          , '+x\n'
+                          , '+y\n'
+                          , ' d\n'
+                          , ' e\n'
+                          , ' f\n']
+                          , list(unified_diff(txt_a, txt_b)))
+        # And the cdv diff
+        self.assertEquals([ '---  \n'
+                          , '+++  \n'
+                          , '@@ -4,6 +4,11 @@\n'
+                          , ' d\n'
+                          , ' e\n'
+                          , ' f\n'
+                          , '+x\n'
+                          , '+y\n'
+                          , '+d\n'
+                          , '+e\n'
+                          , '+f\n'
+                          , ' g\n'
+                          , ' h\n'
+                          , ' i\n'
+                          ]
+                          , list(unified_diff(txt_a, txt_b,
+                                 sequencematcher=SequenceMatcher)))
+
+
+class TestCDVDiffLibFiles(TestCaseInTempDir):
+
+    def test_cdv_unified_diff_files(self):
+        txt_a = [ 'hello there\n'
+                , 'world\n'
+                , 'how are you today?\n']
+        txt_b = [ 'hello there\n'
+                , 'how are you today?\n']
+        open('a1', 'wb').writelines(txt_a)
+        open('b1', 'wb').writelines(txt_b)
+
+        self.assertEquals([ '--- a1 \n'
+                          , '+++ b1 \n'
+                          , '@@ -1,3 +1,2 @@\n'
+                          , ' hello there\n'
+                          , '-world\n'
+                          , ' how are you today?\n'
+                          ]
+                          , list(unified_diff_files('a1', 'b1',
+                                 sequencematcher=SequenceMatcher)))
+
+        txt_a = map(lambda x: x+'\n', 'abcdefghijklmnop')
+        txt_b = map(lambda x: x+'\n', 'abcdefxydefghijklmnop')
+        open('a2', 'wb').writelines(txt_a)
+        open('b2', 'wb').writelines(txt_b)
+
+        # This is the result with LongestCommonSubstring matching
+        self.assertEquals([ '--- a2 \n'
+                          , '+++ b2 \n'
+                          , '@@ -1,6 +1,11 @@\n'
+                          , ' a\n'
+                          , ' b\n'
+                          , ' c\n'
+                          , '+d\n'
+                          , '+e\n'
+                          , '+f\n'
+                          , '+x\n'
+                          , '+y\n'
+                          , ' d\n'
+                          , ' e\n'
+                          , ' f\n']
+                          , list(unified_diff_files('a2', 'b2')))
+
+        # And the cdv diff
+        self.assertEquals([ '--- a2 \n'
+                          , '+++ b2 \n'
+                          , '@@ -4,6 +4,11 @@\n'
+                          , ' d\n'
+                          , ' e\n'
+                          , ' f\n'
+                          , '+x\n'
+                          , '+y\n'
+                          , '+d\n'
+                          , '+e\n'
+                          , '+f\n'
+                          , ' g\n'
+                          , ' h\n'
+                          , ' i\n'
+                          ]
+                          , list(unified_diff_files('a2', 'b2',
+                                 sequencematcher=SequenceMatcher)))

=== modified file 'bzrlib/weave.py' (properties changed)
--- bzrlib/weave.py	
+++ bzrlib/weave.py	
@@ -84,6 +84,7 @@
         )
 import bzrlib.errors as errors
 from bzrlib.osutils import sha_strings
+from bzrlib.patiencediff import SequenceMatcher, unified_diff
 from bzrlib.symbol_versioning import *
 from bzrlib.tsort import topo_sort
 from bzrlib.versionedfile import VersionedFile, InterVersionedFile
@@ -180,9 +181,9 @@
     """

     __slots__ = ['_weave', '_parents', '_sha1s', '_names', '_name_map',
-                 '_weave_name']
+                 '_weave_name', '_matcher']

-    def __init__(self, weave_name=None, access_mode='w'):
+    def __init__(self, weave_name=None, access_mode='w', matcher=None):
         super(Weave, self).__init__(access_mode)
         self._weave = []
         self._parents = []
@@ -190,6 +191,10 @@
         self._names = []
         self._name_map = {}
         self._weave_name = weave_name
+        if matcher is None:
+            self._matcher = SequenceMatcher
+        else:
+            self._matcher = matcher

     def __repr__(self):
         return "Weave(%r)" % self._weave_name
@@ -528,7 +533,7 @@
         #print 'basis_lines:', basis_lines
         #print 'new_lines:  ', lines

-        s = SequenceMatcher(None, basis_lines, lines)
+        s = self._matcher(None, basis_lines, lines)

         # offset gives the number of lines that have been inserted
         # into the weave up to the current point; if the original edit
instruction
@@ -1356,7 +1361,6 @@
         sys.stdout.writelines(w.get_iter(int(argv[3])))

     elif cmd == 'diff':
-        from difflib import unified_diff
         w = readit()
         fn = argv[2]
         v1, v2 = map(int, argv[3:5])


Other than the few things, I think its just about ready to be merged. So
+1 from me after they are handled.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060523/518439ae/attachment.pgp