[MERGE] LCA Merge resolution for tree-shape

Sat Sep 6 10:22:31 BST 2008

I thought I'd review this too in case that helped it get into 1.7, as it's
already been in review for a while.

I have read some but not all tests.  The merge code and documentation of it is
actually pretty clear.
>
>>>                       ensure_null(self.other_basis)]
>>>          if NULL_REVISION in revisions:
>>>              self.base_rev_id = NULL_REVISION
>>> +            self.base_tree = self.revision_tree(self.base_rev_id)
>>>          else:
>>> -            self.base_rev_id, steps = self.revision_graph.find_unique_lca(
>>> -                revisions[0], revisions[1], count_steps=True)
>>> +            lcas = self.revision_graph.find_lca(revisions[0], revisions[1])
>>> +            if len(lcas) == 0:
>>> +                self.base_rev_id = NULL_REVISION
>>> +            elif len(lcas) == 1:
>>> +                self.base_rev_id = list(lcas)[0]
>>> +            else: # len(lcas) > 1
>>
>> ^^^ What is this commented-out code for?
>
> Just documenting the case which is being handled in the 'else'
> statement. It isn't commented-out code. It is a code comment. I can
> remove it. Probably it evolved from when the sections were a bit longer,
> and it may have been a bit more unclear about what was going on.

I think it's fine.

>>> +        base_inventory = self.base_tree.inventory
>>> +        this_inventory = self.this_tree.inventory
>>> +        for path, file_id, other_ie, lca_values in walker.iter_all():
>>> +            # Is this modified at all from any of the other trees?
>>> +            if other_ie is None:
>>> +                other_ie = _none_entry
>>> +            if interesting_ids is not None and file_id not in interesting_ids:
>>> +                continue
>>> +
>>> +            # If other_revision is found in any of the lcas, that means this
>>> +            # node is uninteresting. This is because when merging, if there are
>>> +            # multiple heads(), we have to create a new node. So if we didn't,
>>> +            # we know that the ancestry is linear, and that OTHER did not
>>> +            # modify anything
>>> +            # See doc/developers/lca_merge_resolution.txt for details
>>> +            other_revision = other_ie.revision
>>> +            if other_revision is not None:
>>> +                # We can't use this shortcut when other_revision is None,
>>> +                # because it may be None because things are WorkingTrees, and
>>> +                # not because it is *actually* None.
>>
>> ^^^ When would it *actually* be none unless the trees were workingtrees?
>>  It is required to have a value for RevisionTrees.  _none_entry just
>> makes this logic more confused.
>
> It is None when the entry is deleted. And it makes the rest of the logic
> *a lot* more simple, because I don't have to put if statements all over
> the rest of the code when extracting variables. (text_sha1, text_size,
> revision, parent_id, executable). Instead we have 1 if seeing if the
> entry is available, and the rest just use None, which they would use anyway.

I guess you could add a comment about this.  In general having None
meaning not present on one side is not surprising from deltas etc.

>
> ...
>
>>> +            kind_winner = Merge3Merger._lca_multi_way(
>>> +                (base_ie.kind, lca_kinds),
>>> +                other_ie.kind, this_ie.kind)
>>
>> Why are we doing this?  That sort of thing should be done in merge_contents.
>
> Because we have to know if "file content changed or kind changed". It is
> one of the required return fields.
>
>>
>> When invoking methods on your own class, I think it's best to do it in a
>> generic way, such that subclasses can override those methods.
>>
>> So I'd rather see self._lca_multi_way or perhaps
>> self.__class__._lca_multi_way.  This is the way existing code is written.
>
> Well, inside static-methods you don't have a 'self', but other than that
> I can do so. I think bzrlib itself is a bit fragmented in how it
> approaches this. I think Robert has mentioned that class-level functions
> should be accessed via the class, to avoid confusing where the 'self'
> value is coming from.
>
> I'll switch the ones I can, though.

I agree bzrlib is a bit fragmented here, reflecting evolving opinions
about what works well in Python.

In general I feel it works best to have things be instance methods even if
they don't depend on the state of the particular instance, rather than
making the caller go via __class__ or specify a particular class name.
The exception is things like factory methods that are used before an
instance exists.


>>> +
>>> +    def get_builder(self):
>>> +        builder = self.make_branch_builder('path')
>>> +        builder.start_series()
>>> +        self.addCleanup(builder.finish_series)
>>> +        return builder
>>> +
>>> +    def setup_simple_graph(self):
>>> +        """Create a simple 3-node graph.
>>> +                A
>>> +                |\\
>>> +                B C
>>
>> Do you mean to have two slashes here?  Or did you mean
>>
>>                 A
>>                 |\
>>                 B C
>
> It is a python string, so a single '\' is escaping the next newline
> character. I'll turn it into a comment where it won't cause problems.

Or a raw multiline string should work

 r""" A \ B """

=== modified file 'bzrlib/memorytree.py'

--- bzrlib/memorytree.py	2008-09-02 00:29:28 +0000
+++ bzrlib/memorytree.py	2008-09-05 03:11:40 +0000
@@ -21,10 +21,12 @@


 from copy import deepcopy
+import os

 from bzrlib import (
     errors,
     mutabletree,
+    osutils,
     revision as _mod_revision,
     )
 from bzrlib.decorators import needs_read_lock, needs_write_lock
@@ -99,6 +101,14 @@
             return None, False, None
         return entry.kind, entry.executable, None

+    @needs_tree_write_lock
+    def rename_one(self, from_rel, to_rel):
+        file_id = self.path2id(from_rel)
+        to_dir, to_tail = os.path.split(to_rel)
+        to_parent_id = self._inventory[file_id].parent_id
+        self._file_transport.move(from_rel, to_rel)
+        self._inventory.rename(file_id, to_parent_id, to_tail)
+
     def path_content_summary(self, path):
         """See Tree.path_content_summary."""
         id = self.path2id(path)

Strictly I think this should be using urlutils.split not os.path.split, as
it shouldn't vary across platforms?



=== modified file 'bzrlib/merge.py'
--- bzrlib/merge.py	2008-08-27 20:32:22 +0000
+++ bzrlib/merge.py	2008-09-05 03:11:40 +0000
@@ -28,6 +28,7 @@
     patiencediff,
     registry,
     revision as _mod_revision,
+    tree as _mod_tree,
     tsort,
     )
 from bzrlib.branch import Branch
@@ -96,6 +97,8 @@
         self._revision_graph = revision_graph
         self._base_is_ancestor = None
         self._base_is_other_ancestor = None
+        self._is_criss_cross = None
+        self._lca_trees = None

     @property
     def revision_graph(self):
@@ -353,15 +356,45 @@
                      ensure_null(self.other_basis)]
         if NULL_REVISION in revisions:
             self.base_rev_id = NULL_REVISION
+            self.base_tree = self.revision_tree(self.base_rev_id)
+            self._is_criss_cross = False
         else:
-            self.base_rev_id, steps = self.revision_graph.find_unique_lca(
-                revisions[0], revisions[1], count_steps=True)
+            lcas = self.revision_graph.find_lca(revisions[0], revisions[1])
+            self._is_criss_cross = False
+            if len(lcas) == 0:
+                self.base_rev_id = NULL_REVISION
+            elif len(lcas) == 1:
+                self.base_rev_id = list(lcas)[0]
+            else: # len(lcas) > 1
+                if len(lcas) > 2:
+                    # find_unique_lca can only handle 2 nodes, so we have to
+                    # start back at the beginning. It is a shame to traverse
+                    # the graph again, but better than re-implementing
+                    # find_unique_lca.
+                    self.base_rev_id = self.revision_graph.find_unique_lca(
+                                            revisions[0], revisions[1])
+                else:
+                    self.base_rev_id = self.revision_graph.find_unique_lca(
+                                            *lcas)
+                self._is_criss_cross = True
             if self.base_rev_id == NULL_REVISION:
                 raise UnrelatedBranches()
-            if steps > 1:
+            if self._is_criss_cross:
                 warning('Warning: criss-cross merge encountered.  See bzr'
                         ' help criss-cross.')

Should we update 'help criss-cross' to reflect the improvements from this?

+class _InventoryNoneEntry(object):
+    """This represents an inventory entry which *isn't there*.
+
+    It simplifies the merging logic if we always have an InventoryEntry, even
+    if it isn't actually present
+    """
+    executable = None
+    kind = None
+    name = None
+    parent_id = None
+    revision = None
+    symlink_target = None
+    text_sha1 = None
+
+_none_entry = _InventoryNoneEntry()
+
+

This seems nice, maybe in future it should be a standard part of
Inventory.  (But maybe we want to change it to just use tuples or some
lighter object.)

 class Merge3Merger(object):
     """Three-way merger that uses the merge3 text merger"""
     requires_base = True
@@ -463,12 +516,13 @@
     supports_cherrypick = True
     supports_reverse_cherrypick = True
     winner_idx = {"this": 2, "other": 1, "conflict": 1}
+    supports_lca_trees = True

Could you please add some docstring about the effect of
supports_lca_trees.  There may be no obvious place describing the Merger
interface, if there's no base class at present.

@@ -614,6 +682,177 @@
             result.append((file_id, changed, parents3, names3, executable3))
         return result

+    def _entries_lca(self):
+        """Gather data about files modified between multiple trees.
+
+        This compares OTHER versus all LCA trees, and for interesting entries,
+        it then compares with THIS and BASE.
+
+        For the multi-valued entries, the format will be (BASE, [lca1, lca2])

I guess this is for the parents, names, and executable bit.  The "name,
in, lcas" confused me a bit.

It'd be worth saying this is the parent directory.

How about

  parents: the parent directory file id in each tree, as
    ((base_parent, [lca_parent, ...]), other_parent, this_parent)

+        :return: [(file_id, changed, parents, names, executable)]
+            file_id     Simple file_id of the entry
+            changed     Boolean, True if the kind or contents changed
+                        else False
+            parents     ((base, [parent_id, in, lcas]), parent_id_other,
+                         parent_id_this)
+            names       ((base, [name, in, lcas]), name_in_other, name_in_this)
+            executable  ((base, [exec, in, lcas]), exec_in_other, exec_in_this)
+        """
+        if self.interesting_files is not None:
+            lookup_trees = [self.this_tree, self.base_tree]
+            lookup_trees.extend(self._lca_trees)
+            # I think we should include the lca trees as well
+            interesting_ids = self.other_tree.paths2ids(self.interesting_files,
+                                                        lookup_trees)
+        else:
+            interesting_ids = self.interesting_ids
+        result = []
+        walker = _mod_tree.MultiWalker(self.other_tree, self._lca_trees)
+
+        base_inventory = self.base_tree.inventory
+        this_inventory = self.this_tree.inventory
+        for path, file_id, other_ie, lca_values in walker.iter_all():
+            # Is this modified at all from any of the other trees?
+            if other_ie is None:
+                other_ie = _none_entry
+            if interesting_ids is not None and file_id not in interesting_ids:
+                continue
+
+            # If other_revision is found in any of the lcas, that means this
+            # node is uninteresting. This is because when merging, if there are
+            # multiple heads(), we have to create a new node. So if we didn't,
+            # we know that the ancestry is linear, and that OTHER did not
+            # modify anything
+            # See doc/developers/lca_merge_resolution.txt for details
+            other_revision = other_ie.revision
+            if other_revision is not None:
+                # We can't use this shortcut when other_revision is None,
+                # because it may be None because things are WorkingTrees, and
+                # not because it is *actually* None.
+                is_unmodified = False
+                for lca_path, ie in lca_values:
+                    if ie is not None and ie.revision == other_revision:
+                        is_unmodified = True
+                        break
+                if is_unmodified:
+                    continue
+
+            lca_entries = []
+            for lca_path, lca_ie in lca_values:
+                if lca_ie is None:
+                    lca_entries.append(_none_entry)
+                else:
+                    lca_entries.append(lca_ie)
+
+            if file_id in base_inventory:
+                base_ie = base_inventory[file_id]
+            else:
+                base_ie = _none_entry
+
+            if file_id in this_inventory:
+                this_ie = this_inventory[file_id]
+            else:
+                this_ie = _none_entry
+
+            lca_kinds = []
+            lca_parent_ids = []
+            lca_names = []
+            lca_executable = []
+            for lca_ie in lca_entries:
+                lca_kinds.append(lca_ie.kind)
+                lca_parent_ids.append(lca_ie.parent_id)
+                lca_names.append(lca_ie.name)
+                lca_executable.append(lca_ie.executable)
+
+            kind_winner = self._lca_multi_way(
+                (base_ie.kind, lca_kinds),
+                other_ie.kind, this_ie.kind)
+            parent_id_winner = self._lca_multi_way(
+                (base_ie.parent_id, lca_parent_ids),
+                other_ie.parent_id, this_ie.parent_id)
+            name_winner = self._lca_multi_way(
+                (base_ie.name, lca_names),
+                other_ie.name, this_ie.name)
+
+            content_changed = True
+            if kind_winner == 'this':
+                # No kind change in OTHER, see if there are *any* changes
+                if other_ie.kind == None:
+                    # No content and 'this' wins the kind, so skip this?
+                    # continue
+                    pass
+                elif other_ie.kind == 'directory':
+                    if parent_id_winner == 'this' and name_winner == 'this':
+                        # No change for this directory in OTHER, skip
+                        continue
+                    content_changed = False
+                elif other_ie.kind == 'file':
+                    def get_sha1(ie, tree):
+                        if ie.kind != 'file':
+                            return None
+                        return tree.get_file_sha1(file_id)
+                    base_sha1 = get_sha1(base_ie, self.base_tree)
+                    lca_sha1s = [get_sha1(ie, tree) for ie, tree
+                                 in zip(lca_entries, self._lca_trees)]
+                    this_sha1 = get_sha1(this_ie, self.this_tree)
+                    other_sha1 = get_sha1(other_ie, self.other_tree)
+                    sha1_winner = self._lca_multi_way(
+                        (base_sha1, lca_sha1s), other_sha1, this_sha1,
+                        allow_overriding_lca=False)
+                    exec_winner = self._lca_multi_way(
+                        (base_ie.executable, lca_executable),
+                        other_ie.executable, this_ie.executable)
+                    if (parent_id_winner == 'this' and name_winner == 'this'
+                        and sha1_winner == 'this' and exec_winner == 'this'):
+                        # No kind, parent, name, exec, or content change for
+                        # OTHER, so this node is not considered interesting
+                        continue
+                    if sha1_winner == 'this':
+                        content_changed = False
+                elif other_ie.kind == 'symlink':
+                    def get_target(ie, tree):
+                        if ie.kind != 'symlink':
+                            return None
+                        return tree.get_symlink_target(file_id)
+                    base_target = get_target(base_ie, self.base_tree)
+                    lca_targets = [get_target(ie, tree) for ie, tree
+                                   in zip(lca_entries, self._lca_trees)]
+                    this_target = get_target(this_ie, self.this_tree)
+                    other_target = get_target(other_ie, self.other_tree)
+                    target_winner = self._lca_multi_way(
+                        (base_target, lca_targets),
+                        other_target, this_target)
+                    if (parent_id_winner == 'this' and name_winner == 'this'
+                        and target_winner == 'this'):
+                        # No kind, parent, name, or symlink target change
+                        # not interesting
+                        continue
+                    if target_winner == 'this':
+                        content_changed = False
+                elif other_ie.kind == 'tree-reference':
+                    # The 'changed' information seems to be handled at a higher
+                    # level. At least, _entries3 returns False for content
+                    # changed, even when at a new revision_id.
+                    content_changed = False
+                    if (parent_id_winner == 'this' and name_winner == 'this'):
+                        # Nothing interesting
+                        continue
+                else:
+                    raise AssertionError('unhandled kind: %s' % other_ie.kind)
+                # XXX: We need to handle kind == 'symlink'

^^ it looks like you did?

So this is pretty long obviously, but clear enough for now.

+
+            # If we have gotten this far, that means something has changed
+            result.append((file_id, content_changed,
+                           ((base_ie.parent_id, lca_parent_ids),
+                            other_ie.parent_id, this_ie.parent_id),
+                           ((base_ie.name, lca_names),
+                            other_ie.name, this_ie.name),
+                           ((base_ie.executable, lca_executable),
+                            other_ie.executable, this_ie.executable)
+                          ))
+        return result
+
+
     def fix_root(self):
         try:
             self.tt.final_kind(self.tt.root)
@@ -708,6 +947,56 @@
             return "other"

     @staticmethod
+    def _lca_multi_way(bases, other, this, allow_overriding_lca=True):
+        """Consider LCAs when determining whether a change has occurred.
+
+        If LCAS are all identical, this is the same as a _three_way comparison.
+
+        :param bases: value in (BASE, [LCAS])
+        :param other: value in OTHER
+        :param this: value in THIS
+        :param allow_overriding_lca: If there is more than one unique lca
+            value, allow OTHER to override THIS if it has a new value, and
+            THIS only has an lca value, or vice versa. This is appropriate for
+            truly scalar values, not as much for non-scalars.
+        :return: 'this', 'other', or 'conflict' depending on whether an entry
+            changed or not.
+        """
+        # See doc/developers/lca_merge_resolution.txt for details about this
+        # algorithm.
+        if other == this:
+            # Either Ambiguously clean, or nothing was actually changed. We
+            # don't really care
+            return 'this'
+        base_val, lca_vals = bases
+        # Remove 'base_val' from the lca_vals, because it is not interesting
+        filtered_lca_vals = [lca_val for lca_val in lca_vals
+                                      if lca_val != base_val]
+        if len(filtered_lca_vals) == 0:
+            return Merge3Merger._three_way(base_val, other, this)
+
+        unique_lca_vals = set(filtered_lca_vals)


(trivial) You might as well just compute it directly

  unique_lca_vals = set(lca_vals).difference([base_val])
  if not unique_lca_vals:
    ..

=== modified file 'bzrlib/tests/test_memorytree.py'
--- bzrlib/tests/test_memorytree.py	2008-07-22 18:12:25 +0000
+++ bzrlib/tests/test_memorytree.py	2008-07-23 02:35:09 +0000
+    def test_rename_file_to_subdir(self):
+        tree = self.make_branch_and_memory_tree('branch')
+        tree.lock_write()
+        self.addCleanup(tree.unlock)
+        tree.add('')
+        tree.mkdir('subdir', 'subdir-id')
+        tree.add('foo', 'foo-id', 'file')
+        tree.put_file_bytes_non_atomic('foo-id', 'content\n')
+        tree.commit('one', rev_id='rev-one')
+
+        tree.rename_one('foo', 'subdir/bar')
+        self.assertEqual('subdir/bar', tree.id2path('foo-id'))
+        self.assertEqual('content\n',
+                         tree._file_transport.get_bytes('subdir/bar'))
+        tree.commit('two', rev_id='rev-two')
+        rev_tree2 = tree.branch.repository.revision_tree('rev-two')
+        self.assertEqual('subdir/bar', rev_tree2.id2path('foo-id'))

I'd be interested to see if rename('file', 'subdir') and rename('file',
'subdir/') work as you'd expect; I think they may not, but that's not
necessary for this patch.  Maybe you could add an XXX here.

=== added file 'doc/developers/lca_tree_merging.txt'
--- doc/developers/lca_tree_merging.txt	1970-01-01 00:00:00 +0000
+++ doc/developers/lca_tree_merging.txt	2008-09-05 02:29:34 +0000
@@ -0,0 +1,126 @@
+================
+LCA Tree Merging
+================
+
+There are 2 ways that you get LCA merge resolution in bzr. First, if you use
+``bzr merge --lca``, the *content* of files will be resolved using a
Least Common
+Ancestors algorithm. That is described in <lca-merge.html> not here.
+
+This document describes how we handle merging tree-shape when there is not
+a single unique ancestor (criss-cross merge). With a single LCA, we use
+simple 3-way-merge logic.
+
+When there are multiple possible LCAs, we use a different algorithm for
+handling tree-shape merging. Described here.

This is a really nice description of the algorithm, thanks for adding it.

In this last paragraph, just join them up into "merging, described here".


+
+As a simple example, here is a revision graph which we will refer to often::
+
+  .    BASE
+  .  /      \
+  . LCA1   LCA2
+  . |   \ /   |
+  . |    X    |
+  . |   / \   |
+  . THIS  OTHER
+
+In this graph, ``THIS`` and ``OTHER`` both have ``LCA1`` and ``LCA2`` in their
+ancestry but neither is an ancestor of the other, so we have 2 least common
+ancestors. The unique common ancestor is ``BASE``. (It should be noted that in
+this text we will talk directly about ``LCA1`` and ``LCA2``, but the algorithms
+are designed to cope with more than 2 LCAs.)
+
+
+Scalars
+=======
+
+Definition
+----------
+
+I'm defining scalar values as ones that cannot be 'merged' on their own. For
+example, the name of a file is "scalar". If one person changes "foo.txt" to
+"foo.c" and someone else changes "foo.txt" to "bar.txt" we don't merge the
+changes to be "bar.c", we simply conflict and expect the user to sort it out.
+
+We use a slightly different algorithm for scalars.

It'd be useful to describe here what scalars you're trying to merge.
Obviously not the file text.  For instance could you merge the file's name
from one ancestor and the directory from another?


+
+
+Resolution Algorithm
+--------------------
+
+(This can be seen as ``bzrlib.merge.Merge3Merger._lca_multi_way```
+
+1. If ``THIS`` and ``OTHER`` have the same value, use it. There is no need to
+   inspect any other values in this case. Either nothing was changed (all
+   interesting nodes would have the same value), or we have "accidental
+   convergence" (both sides made the same change.).
+
+2. Find the values from ``LCA1`` and ``LCA2`` which are not the same as
+   ``BASE``. The idea here is to provide a rudimentary "heads" comparison.
+   Often, the whole tree graph will have a criss-cross, but the per-file
+   (per-scalar) graph would be linear, and the value in one LCA strictly
+   dominates the other. It is possible to construct a scenario where one side
+   dominates the other, but the dominated value is not ``BASE``, but a second
+   intermediate value. Most scalars are rarely changed, so this is unlikely to
+   be an issue. The trade-off is having to generate and inspect the
+   per-scalar graph.
+
+   If there are no LCA values that are different from ``BASE``, we use a simple
+   3-way merge with ``BASE`` as the base value.
+
+3. Find the unique set of LCA values that do not include the ``BASE`` value.
+   If there is only one unique LCA value, we again use three-way merge logic
+   using that unique value as the base.
+
+4. At this point, we have determined that we have at least 2 unique values in
+   our LCAs which means that ``THIS`` and ``OTHER`` would both have to resolve
+   the conflict. If they resolved it in the same way, we would have caught that
+   in step 1. So they either both picked a different LCA value, or one (or
+   both) chose a new value to use.
+
+   If ``OTHER`` and ``THIS`` both picked a different LCA value, we conflict.
+
+   If ``OTHER`` and ``THIS`` both have values that are not LCA values, we also
+   conflict. (Same as 3-way, both sides modified a value in different ways.)
+
+5. (optional) The only tricky part is this: if ``OTHER`` has a LCA value, but
+   ``THIS`` does not, then we go with ``THIS``, and conversely if ``THIS`` has
+   an LCA value, but ``OTHER`` does not, then we go with ``OTHER``. The idea is
+   that ``THIS`` and ``OTHER`` may have resolved things in the same way, and
+   then later changed the value to something newer. (They could have also
+   resolved it differently, and then one side updated again.)

I wonder what option turns this on.  Is it something that's not yet
implemented, or is it accessible to the user as an option?

-- 
Martin <http://launchpad.net/~mbp/>