[MERGE] LCA Merge resolution for tree-shape

Fri Sep 5 04:18:06 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Interestingly enough, I just sent this a few minutes ago, and I ended up
getting bit by a "criss-cross" merge. Some of the history of this branch
was already merged into bzr.dev, which caused the BASE selection to go
very ancient. The 'send' output was 1.1MB in size, rather than 150kB.
Interestingly, the easiest way to 'fix' send is to just merge in
bzr.dev. (Otherwise I think you could specify exactly the right base
revision to get the diff right, but it could easily include too much in
the bundle data.)

Aaron Bentley wrote:
> I haven't finished going over the test cases, but I've seen enough to
> know there's work needed.
>
> bb:resubmit
>
> John Arbash Meinel wrote:
>> I know 2k line patches are not fun to review. But if it makes you feel better,
>> only about 600 lines are actual content changes, the rest are the tests making
>> sure the content is correct. :)
>
> The tests are, if anything, harder to read than the implementation.

I've tried to add a bit more comments to the tests, but honestly, the
tests are complex because they are testing genuine edge cases. Which can
easily correspond to broken code.

>
>> === modified file 'bzrlib/memorytree.py'
>> --- bzrlib/memorytree.py	2008-07-22 18:12:25 +0000
>> +++ bzrlib/memorytree.py	2008-07-23 02:35:09 +0000
>> @@ -101,6 +103,14 @@
>>              return None, False, None
>>          return entry.kind, entry.executable, None
>
>> +    @needs_tree_write_lock
>> +    def rename_one(self, from_rel, to_rel):
>> +        file_id = self.path2id(from_rel)
>> +        to_dir, to_tail = os.path.split(to_rel)
>> +        to_parent_id = self.path2id(to_dir)
>
> It seems a bit strange to implement parent_id calculation here.  Since
> you know you're working with an inventory, why not use
> Inventory[file_id].parent?

I can just use self._inventory[file_id].parent_id

> I was actually hoping that in the future, rename functionality would be
> implemented in terms of apply_inventory_delta.

I seem to recall trying apply_inventory_delta first, but it actually
meant implementing a lot more of the api, because
MT.apply_inventory_delta wants to do stuff like "_write_inventory()"
which doesn't make sense for a MemoryTree.
Also, at some point you have to move the things on disk, which is a MT
function, not an inventory one.

>>          self._revision_graph = revision_graph
>>          self._base_is_ancestor = None
>>          self._base_is_other_ancestor = None
>> +        self._is_criss_cross = False
>
> ^^^ None is a better default than False when no value has been
> determined.  It also matches the surrounding code better.
>

I suppose. There is only one place that can set it to True, and several
paths that should set it to False. I went ahead and did it, and then
just default it to False in 'find_base'.

>>                       ensure_null(self.other_basis)]
>>          if NULL_REVISION in revisions:
>>              self.base_rev_id = NULL_REVISION
>> +            self.base_tree = self.revision_tree(self.base_rev_id)
>>          else:
>> -            self.base_rev_id, steps = self.revision_graph.find_unique_lca(
>> -                revisions[0], revisions[1], count_steps=True)
>> +            lcas = self.revision_graph.find_lca(revisions[0], revisions[1])
>> +            if len(lcas) == 0:
>> +                self.base_rev_id = NULL_REVISION
>> +            elif len(lcas) == 1:
>> +                self.base_rev_id = list(lcas)[0]
>> +            else: # len(lcas) > 1
>
> ^^^ What is this commented-out code for?

Just documenting the case which is being handled in the 'else'
statement. It isn't commented-out code. It is a code comment. I can
remove it. Probably it evolved from when the sections were a bit longer,
and it may have been a bit more unclear about what was going on.

>
>> +                if len(lcas) > 2:
>> +                    # find_unique_lca can only handle 2 nodes, so we have to
>> +                    # start back at the beginning. It is a shame to traverse
>> +                    # the graph again, but better than re-implementing
>> +                    # find_unique_lca.
>
> I think you've identified an API failure.  We should fix it, and I don't
> think it would be hard.

Actually, it would. Mostly because you already updated the api in a
"backward compatible" way:
def find_unique_lca(self, left_revision, right_revision,
                    count_steps=False):

So we can't take an unlimited number of arguments, because we have a
keyword argument. Unless we went the:

def find_unique_lca(self, *args, **kwargs):
  count_steps = kwargs.get('count_steps')

However, I'm pretty sure you already use the api as:
  find_unique_lca(left, right, True)

rather than
  find_unique_lca(left, right, count_steps=True)

So the only api compatible way of doing it is to write a different
function, and thunk over to it.

>
> One option is to update the Graph API so that Graph.find_unique_lca
> takes an unlimited number of *args.  AFAIK, there's only one real
> implementation of Graph.

True, but there are several places that call it.

>
> If you don't want to change the Graph API, another option is to
> implement graph.find_unique_lca at the module level.
> Graph.find_unique_lca could then be implemented in terms of
> graph.find_unique_lca.

That doesn't sound a whole lot better. Now we have 2 functions that
might be addressed as "graph.find_unique_lca" only with different
signatures.

Re-traversing the graph is unfortunate, but not terrible. And I'd like
to postpone rewriting find_unique_lca on top of the rest of this.

>
>> +                    self.base_rev_id = self.revision_graph.find_unique_lca(
>> +                                            revisions[0], revisions[1])
>> +                else:
>> +                    self.base_rev_id = self.revision_graph.find_unique_lca(
>> +                                            *lcas)
>> +                self._is_criss_cross = True
>>              if self.base_rev_id == NULL_REVISION:
>>                  raise UnrelatedBranches()
>> -            if steps > 1:
>> +            if self._is_criss_cross:
>>                  warning('Warning: criss-cross merge encountered.  See bzr'
>>                          ' help criss-cross.')
>> -        self.base_tree = self.revision_tree(self.base_rev_id)
>> +                interesting_revision_ids = [self.base_rev_id]
>> +                interesting_revision_ids.extend(lcas)
>> +                interesting_trees = dict((t.get_revision_id(), t)
>> +                    for t in self.this_branch.repository.revision_trees(
>> +                        interesting_revision_ids))
>> +                self._cached_trees.update(interesting_trees)
>> +                self.base_tree = interesting_trees.pop(self.base_rev_id)
>> +                sorted_lca_keys = self.revision_graph.find_merge_order(
>> +                    revisions[0], lcas)
>> +                self._lca_trees = [interesting_trees[key]
>> +                                   for key in sorted_lca_keys]
>
> It seems like we need to implement Merger.revision_trees (or at least a
> private method):
>
> 1. touching _cached_trees in find_base is a layering issue, and
> duplicates code
> 2. failing to use Tree.revision_tree means that we cannot use the
> optimizations in DirStateRevisionTree.
> 3. it means fewer special cases in find_base
>

So the big win is to extract the base tree along with the revision trees
at the same time. I would guess extracting them all at the same time is
a bigger win than DirStateRevisionTree. We don't have iter_changes
anymore anyway. Also this is BASE, LCAs, and OTHER which are *rarely*
going to be the same as WT.basis_tree(). Especially considering we know
we have a criss-cross merge.

I don't believe I actually make use of cached_trees much. I didn't
change this code yet.

>> +            else:
>> +                self.base_tree = self.revision_tree(self.base_rev_id)
>>          self.base_is_ancestor = True
>>          self.base_is_other_ancestor = True
>
>> @@ -550,19 +611,24 @@
>>          return self.tt
>
>>      def _compute_transform(self):
>> -        entries = self._entries3()
>> +        if self._lca_trees is None:
>> +            entries = self._entries3()
>> +            resolver = self._three_way
>> +        else:
>> +            entries = self._entries_lca()
>> +            resolver = self._lca_multi_way
>>          child_pb = ui.ui_factory.nested_progress_bar()
>>          try:
>>              for num, (file_id, changed, parents3, names3,
>>                        executable3) in enumerate(entries):
>
> ^^^ I think these variable names should be changed.

What specific names? the parents3? or the 'entries' or 'resolver'?

>
>>                  child_pb.update('Preparing file merge', num, len(entries))
>> -                self._merge_names(file_id, parents3, names3)
>> +                self._merge_names(file_id, parents3, names3, resolver=resolver)
>>                  if changed:
>>                      file_status = self.merge_contents(file_id)
>>                  else:
>>                      file_status = 'unmodified'
>>                  self._merge_executable(file_id,
>> -                    executable3, file_status)
>> +                    executable3, file_status, resolver=resolver)
>>          finally:
>>              child_pb.finished()
>>          self.fix_root()
>> @@ -614,6 +680,178 @@
>>              result.append((file_id, changed, parents3, names3, executable3))
>>          return result
>
>> +    def _entries_lca(self):
>> +        """Gather data about files modified between multiple trees.
>> +
>> +        This compares OTHER versus all LCA trees, and for interesting entries,
>> +        it then compares with THIS and BASE.
>> +
>> +        For the multi-valued entries, the format will be (BASE, [lca1, lca2])
>
> This seems inaccurate.  The value is actually (BASE, [lca1, lca2]),
> OTHER, THIS
>
> But I don't think it should be.  BASE, LCAS, OTHER, THIS makes a lot
> more sense to me.  I don't see a need to join the BASE with the LCAs in
> a tuple.

So the specific reason to do so is because of code like
"_merge_names(names3)" which unpacks the tuple. It does:
base_name, other_name, this_name = names

And then ignores 'base_name'. So by putting "(base, [lca1, lca2])" into
that single variable, I didn't have to update any other code.

>
>> +        :return: [(file_id, changed, parents, names, executable)]
>> +            file_id     Simple file_id of the entry
>> +            changed     Boolean, True if the kind or contents changed
>> +                        else False
>> +            parents     ((base, [parent_id, in, lcas]), parent_id_other,
>> +                         parent_id_this)
>> +            names       ((base, [name, in, lcas]), name_in_other, name_in_this)
>> +            executable  ((base, [exec, in, lcas]), exec_in_other, exec_in_this)
>> +        """
>> +        if self.interesting_files is not None:
>> +            lookup_trees = [self.this_tree, self.base_tree]
>> +            lookup_trees.extend(self._lca_trees)
>> +            # I think we should include the lca trees as well
>> +            interesting_ids = self.other_tree.paths2ids(self.interesting_files,
>> +                                                        lookup_trees)
>
> Filtering for interesting entries seems like it should be done in
> MultiWalker.

We could, but it seemed reasonable to keep MultiWalker simple and filter
above it. We don't save much CPU filtering inside multi walker (we've
already built up inventory objects, so there isn't a lot to be saved.)

I suppose I did it this way because I built things up in steps, and
filtering was one of the last things I did (as it is only needed for
partial merges, which are fairly rare.)

If you require, I can put it there, but I think it works fine either
way, and I've already written this code.

>
>> +        else:
>> +            interesting_ids = self.interesting_ids
>> +        result = []
>> +        # XXX: Do we want a better sort order than this?
>> +        walker = _mod_tree.MultiWalker(self.other_tree, self._lca_trees)
>
> parent-to-child order will help if the merge introduces a lot of new
> files inside new directories, because it TT will attempt to place files
> inside parents if the parents are already in limbo.

It *does* do parent-to-child order, it just does it one tree at a time.
So it matches all trees parent-to-child based on OTHER, and then walks
LCA1, and LCA2 in order of missing nodes.
The only thing we could do, would be to walk all of OTHER, LCA1, LCA2 in
pure parent-to-child. But we can't strictly get that, because it is
possible for OTHER to rename a directory into its own subdir relative to
LCA1.

I can certainly take out the XXX comment, though, as it doesn't strictly
apply.

>
>> +        base_inventory = self.base_tree.inventory
>> +        this_inventory = self.this_tree.inventory
>> +        for path, file_id, other_ie, lca_values in walker.iter_all():
>> +            # Is this modified at all from any of the other trees?
>> +            if other_ie is None:
>> +                other_ie = _none_entry
>> +            if interesting_ids is not None and file_id not in interesting_ids:
>> +                continue
>> +
>> +            # If other_revision is found in any of the lcas, that means this
>> +            # node is uninteresting. This is because when merging, if there are
>> +            # multiple heads(), we have to create a new node. So if we didn't,
>> +            # we know that the ancestry is linear, and that OTHER did not
>> +            # modify anything
>> +            # See doc/developers/lca_merge_resolution.txt for details
>> +            other_revision = other_ie.revision
>> +            if other_revision is not None:
>> +                # We can't use this shortcut when other_revision is None,
>> +                # because it may be None because things are WorkingTrees, and
>> +                # not because it is *actually* None.
>
> ^^^ When would it *actually* be none unless the trees were workingtrees?
>  It is required to have a value for RevisionTrees.  _none_entry just
> makes this logic more confused.

It is None when the entry is deleted. And it makes the rest of the logic
*a lot* more simple, because I don't have to put if statements all over
the rest of the code when extracting variables. (text_sha1, text_size,
revision, parent_id, executable). Instead we have 1 if seeing if the
entry is available, and the rest just use None, which they would use anyway.

...

>> +            kind_winner = Merge3Merger._lca_multi_way(
>> +                (base_ie.kind, lca_kinds),
>> +                other_ie.kind, this_ie.kind)
>
> Why are we doing this?  That sort of thing should be done in merge_contents.

Because we have to know if "file content changed or kind changed". It is
one of the required return fields.

>
> When invoking methods on your own class, I think it's best to do it in a
> generic way, such that subclasses can override those methods.
>
> So I'd rather see self._lca_multi_way or perhaps
> self.__class__._lca_multi_way.  This is the way existing code is written.

Well, inside static-methods you don't have a 'self', but other than that
I can do so. I think bzrlib itself is a bit fragmented in how it
approaches this. I think Robert has mentioned that class-level functions
should be accessed via the class, to avoid confusing where the 'self'
value is coming from.

I'll switch the ones I can, though.

>
>> @@ -744,14 +1032,16 @@
>>                  parents.append(entry.parent_id)
>>          return self._merge_names(file_id, parents, names)
>
>> -    def _merge_names(self, file_id, parents, names):
>> +    def _merge_names(self, file_id, parents, names, resolver=None):
>
> I think resolver should become a mandatory parameter.  It's a private
> method, after all.

Done. It just required modifying other code (merge_names() versus
_merge_names()) but not a big deal.

>
>> === modified file 'bzrlib/tests/test_merge.py'
>> --- bzrlib/tests/test_merge.py	2008-07-16 16:59:32 +0000
>> +++ bzrlib/tests/test_merge.py	2008-08-01 16:40:01 +0000
>
>> @@ -1089,3 +1091,1635 @@
>>      def test_merge_move_and_change(self):
>>          self.expectFailure("lca merge doesn't conflict for move and change",
>>              super(TestLCAMerge, self).test_merge_move_and_change)
>> +
>> +
>> +class LoggingMerger(object):
>> +    # These seem to be the required attributes
>> +    requires_base = False
>> +    supports_reprocess = False
>> +    supports_show_base = False
>> +    supports_cherrypick = False
>> +    # We intentionally do not define supports_lca_trees
>> +
>> +    def __init__(self, *args, **kwargs):
>> +        self.args = args
>> +        self.kwargs = kwargs
>> +
>> +
>> +class TestMergerBase(TestCaseWithMemoryTransport):
>> +    """Tests for Merger that can be done without hitting disk."""
>
> This description doesn't seem accurate.  The class has no tests.

    """Common functionality for Merger tests that don't write to disk."""

Is that better?

>
>> +
>> +    def get_builder(self):
>> +        builder = self.make_branch_builder('path')
>> +        builder.start_series()
>> +        self.addCleanup(builder.finish_series)
>> +        return builder
>> +
>> +    def setup_simple_graph(self):
>> +        """Create a simple 3-node graph.
>> +                A
>> +                |\\
>> +                B C
>
> Do you mean to have two slashes here?  Or did you mean
>
>                 A
>                 |\
>                 B C

It is a python string, so a single '\' is escaping the next newline
character. I'll turn it into a comment where it won't cause problems.

>
>> +
>> +        :return: A BranchBuilder
>> +        """
>> +        builder = self.get_builder()
>> +        builder.build_snapshot('A-id', None,
>> +            [('add', ('', None, 'directory', None))])
>> +        builder.build_snapshot('C-id', ['A-id'], [])
>> +        builder.build_snapshot('B-id', ['A-id'], [])
>> +        return builder
>> +
>> +    def setup_criss_cross_graph(self):
>> +        """Create a 5-node graph with a criss-cross.
>> +                A
>> +                |\\
>> +                B C
>> +                |X|
>> +                D E
>
> Double-slash again...

Same reason.

>
>
>> +        :return: A BranchBuilder
>> +        """
>> +        builder = self.setup_simple_graph()
>> +        builder.build_snapshot('E-id', ['C-id', 'B-id'], [])
>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'], [])
>> +        return builder
>> +
>> +    def make_Merger(self, builder, other_revision_id,
>> +                    interesting_files=None, interesting_ids=None):
>> +        """Make a Merger object from a branch builder"""
>> +        mem_tree = memorytree.MemoryTree.create_on_branch(builder.get_branch())
>> +        mem_tree.lock_write()
>> +        self.addCleanup(mem_tree.unlock)
>> +        merger = _mod_merge.Merger.from_revision_ids(progress.DummyProgress(),
>> +            mem_tree, other_revision_id)
>> +        if interesting_files is not None:
>> +            merger.set_interesting_files(interesting_files)
>
> You don't need this conditional; set_interesting_files(None) will have
> no effect.
>
>> +        if interesting_ids is not None:
>> +            # It seems there is no matching function for set_interesting_ids
>> +            merger.interesting_ids = interesting_ids
>
> Again, no conditional.  (Not sure why interesting_ids isn't a function,
> though)

Done.

>
>> +        merger.merge_type = _mod_merge.Merge3Merger
>> +        return merger
>> +
>> +
>> +class TestMergerInMemory(TestMergerBase):
>> +
>> +    def test_find_base(self):
>> +        merger = self.make_Merger(self.setup_simple_graph(), 'C-id')
>> +        self.assertEqual('A-id', merger.base_rev_id)
>> +        self.assertFalse(merger._is_criss_cross)
>> +        self.assertIs(None, merger._lca_trees)
>> +
>> +    def test_find_base_criss_cross(self):
>> +        builder = self.setup_criss_cross_graph()
>> +        merger = self.make_Merger(builder, 'E-id')
>> +        self.assertEqual('A-id', merger.base_rev_id)
>> +        self.assertTrue(merger._is_criss_cross)
>> +        self.assertEqual(['B-id', 'C-id'], [t.get_revision_id()
>> +                                            for t in merger._lca_trees])
>> +        # If we swap the order, we should get a different lca order
>
> Is the LCA order significant?  It seems like it shouldn't be.

Mostly it is because ordering that means that all the tests can check
that each LCA's value was properly copied across, without having to
resort to strange sorting tricks later on.

It doesn't change the *final* outcomes (modified, not modified, etc.) It
just makes the tests easier to write.

...

>> +
>> +    def test_criss_cross_not_supported_merge_type(self):
>> +        class NoLCATreesMerger(LoggingMerger):
>> +            # We intentionally do not define supports_lca_trees
>> +            pass
>
> Can't you just use LoggingMerger here instead of defining a new class?

I felt the test was clearer by using an explicitly named class and a
comment. If you prefer, I can just use LoggingMerger and a comment.

...

>> +    def test_simple(self):
>> +        builder = self.get_builder()
>> +        builder.build_snapshot('A-id', None,
>> +            [('add', (u'', 'a-root-id', 'directory', None)),
>> +             ('add', (u'a', 'a-id', 'file', 'a\nb\nc\n'))])
>> +        builder.build_snapshot('C-id', ['A-id'],
>> +            [('modify', ('a-id', 'a\nb\nC\nc\n'))])
>> +        builder.build_snapshot('B-id', ['A-id'],
>> +            [('modify', ('a-id', 'a\nB\nb\nc\n'))])
>> +        builder.build_snapshot('E-id', ['C-id', 'B-id'],
>> +            [('modify', ('a-id', 'a\nB\nb\nC\nc\nE\n'))])
>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'],
>> +            [('modify', ('a-id', 'a\nB\nb\nC\nc\n'))])
>> +        merge_obj = self.make_merge_obj(builder, 'E-id')
>
> It seems like you could profitably make some helper functions to
> construct these snapshots.  Do you nee any more data than
>
>   new_snapshot('modify', 'B-id', ['A-id'])

This specific test does not. But the exact content of the file is often
important.

>
> ?
>
> Also, it's weird to build D-id after E-id.  I realize you're doing it
> because this causes D-id to be selected by make_merge_object, but could
> you swap the names?

I fairly carefully followed the same graph
  A
  |\
  B C
  |X|
  D E

And didn't re-order them. I can build it in any order and then tell the
builder to use D before I do the merge. I mostly did it this way because
it is faster. By building in this particular order, the builder gets to
re-use the left-hand parent most of the time, which saves time of
regenerating the inventory from the repository, etc.

The full set of tests run in only a few seconds, which was very useful
for me when writing the code.

So... I *can* change the names, but I would like to keep the graph
similar in most cases, and I'd like to keep the tests fast to run. If
you require, I can change it.

>> +    def test_not_in_base(self):
>> +        # LCA's all have the same last-modified revision for the file, as do
>
> Plural should be "LCAs"
>
>> +        # the tips, but the base has something different
>> +        #       A    base, doesn't have the file
>> +        #       |\
>> +        #       B C  B introduces 'foo', C introduces 'bar'
>> +        #       |X|
>> +        #       D E  D and E now both have 'foo' and 'bar'
>> +        #       |X|
>> +        #       F G  the files are now in F, G, D and E, but not in A
>> +        #            G modifies 'bar'
>> +
>> +        builder = self.get_builder()
>> +        builder.build_snapshot('A-id', None,
>> +            [('add', (u'', 'a-root-id', 'directory', None))])
>> +        builder.build_snapshot('B-id', ['A-id'],
>> +            [('add', (u'foo', 'foo-id', 'file', 'a\nb\nc\n'))])
>> +        builder.build_snapshot('C-id', ['A-id'],
>> +            [('add', (u'bar', 'bar-id', 'file', 'd\ne\nf\n'))])
>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'],
>> +            [('add', (u'bar', 'bar-id', 'file', 'd\ne\nf\n'))])
>
> Why is this an add, not modify?  The file exists in C-id, and D-id is
> derived from it.

Because when building a snapshot it is based strictly on its left-hand
parent. We don't actually run any "merge" code. So if you have:

A
|\
B C  # C introduces a file
|/
D    # D appears to introduce the file relative to B

So we have to 'add' it both times. In fact, if you look, D does not
modify it relative to C. The limitation is mostly that it makes the
Builder code simpler, partially because most of the merge code doesn't
work on MemoryTree.

>
>> +    def test_file_not_in_one_lca(self):
>> +        builder = self.get_builder()
>> +        builder.build_snapshot('A-id', None,
>> +            [('add', (u'', 'a-root-id', 'directory', None))])
>> +        builder.build_snapshot('B-id', ['A-id'], [])
>> +        builder.build_snapshot('C-id', ['A-id'],
>> +            [('add', (u'a', 'a-id', 'file', 'a\nb\nc\n'))])
>> +        builder.build_snapshot('E-id', ['C-id', 'B-id'], []) # Inherited from C
>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'], # Merged from C
>> +            [('add', (u'a', 'a-id', 'file', 'a\nb\nc\n'))])
>> +        merge_obj = self.make_merge_obj(builder, 'E-id')
>> +
>> +        self.assertEqual(['B-id', 'C-id'], [t.get_revision_id()
>> +                                            for t in merge_obj._lca_trees])
>> +        self.assertEqual('A-id', merge_obj.base_tree.get_revision_id())
>> +
>> +        entries = list(merge_obj._entries_lca())
>> +        root_id = 'a-root-id'
>> +        self.assertEqual([], entries)
>
> I'm pretty sure this counts as a changed entry.  Why are we not emitting
> something?

Because nothing is changed relative to a common base. D resolved adding
the file the same way that E resolved adding the file. If you ignore B
for a second, you can see that for "CDE" there is no change.

        #   A   # just root
        #   |\
        #   B C # B no file, C introduces a file
        #   |X|
        #   D E # D and E both have the file, unchanged from C

^- I added this as a comment. I feel like there is no change to be
emitted. What change would you like to see?
This is, honestly, one of the key reasons for this code, because it can
detect that this is 'uninteresting'. One LCA has the same value as BASE,
and the other LCA supersedes it. The new tips have the same value as the
superseding LCA.

>
>> +    def test_one_lca_supersedes(self):
>> +        # One LCA supersede's the other LCA's last modified value, but the
>
> No apostrophe in "supersedes".

Done

>
>> +        # value is not the same as BASE.
>> +        #       A    base, introduces 'foo', last mod A
>> +        #       |\
>> +        #       B C  B modifies 'foo' (mod B), C does nothing (mod A)
>> +        #       |X|
>> +        #       D E  D does nothing (mod B), E updates 'foo' (mod E)
>> +        #       |X|
>> +        #       F G  F updates 'foo' (mod F). G does nothing (mod E)
>> +        #
>> +        #   At this point, G should not be considered to modify 'foo', even
>> +        #   though its LCAs disagree. This is because the modification in E
>> +        #   completely supersedes the value in D.
>> +        builder = self.get_builder()
>> +        builder.build_snapshot('A-id', None,
>> +            [('add', (u'', 'a-root-id', 'directory', None)),
>> +             ('add', (u'foo', 'foo-id', 'file', 'A content\n'))])
>> +        builder.build_snapshot('C-id', ['A-id'], [])
>> +        builder.build_snapshot('B-id', ['A-id'],
>> +            [('modify', ('foo-id', 'B content\n'))])
>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'], [])
>> +        builder.build_snapshot('E-id', ['C-id', 'B-id'],
>> +            [('modify', ('foo-id', 'E content\n'))])
>> +        builder.build_snapshot('G-id', ['E-id', 'D-id'], [])
>> +        builder.build_snapshot('F-id', ['D-id', 'E-id'],
>> +            [('modify', ('foo-id', 'F content\n'))])
>> +        merge_obj = self.make_merge_obj(builder, 'G-id')
>> +
>> +        self.assertEqual([], list(merge_obj._entries_lca()))
>
> This is mind-melting.  Can we please simplify _entries_lca so that it
> emits *everything* except entries that are the same in THIS and OTHER?
> Trying to keep track of when it defers merge resolution to merge_names
> vs resolving the merge itself is not easy.

Well, because it does it correctly, and keeps us from having to deal
with entries that are not genuinely modified. I realize it is complex.
Most of that is real-world complexity derived from examples in the mysql
history. I believe this is one of the explicit points that they ran
into. Their merge case does indeed have a double-criss-cross that causes
things to look modified when they shouldn't. To get the merge correct,
we have to treat those paths and content as unmodified. (Because it
genuinely hasn't given the per-file graph.)

In order to generate the right values for the various fields (content
changed, filename changed, etc) I've done 90% of the work of figure out
whether or not we need to do any more work. Why not just omit the ones
that I know don't need to be processed further?

>
>> === added file 'doc/developers/lca_merge_resolution.txt'
>> --- doc/developers/lca_merge_resolution.txt	1970-01-01 00:00:00 +0000
>> +++ doc/developers/lca_merge_resolution.txt	2008-07-31 19:12:59 +0000
>> @@ -0,0 +1,127 @@
>> +====================
>> +LCA Merge Resolution
>> +====================
>
> Probably better to call it "LCA Tree merging" or "LCA scalar merging",
> just to reduce confusion further

I went with "LCA Tree Merging".

>
>> +
>> +There are 2 ways that you get LCA merge resolution in bzr. First, if you use
>> +``bzr merge --lca``, the content of files will be resolved using a Least Common
>> +Ancestors algorithm. That is not being described here.
>
> Perhaps a link to the spec for LCA text merging?

Done.

>
>> +This is describing how we handle merging tree-shape when there is not a single
>
> Perhaps "This describes"?

This document describes

...

>> +
>> +2. Find the values from ``LCA1`` and ``LCA2`` which are not the same as
>> +   ``BASE``. The idea here is to provide a rudimentary "heads" comparison.
>> +   Often, the whole tree graph will have a criss-cross, but the per-file
>> +   (per-scalar) graph would be linear. And the value in one LCA strictly
>> +   dominates the other. It is possible to construct a scenario where one side
>> +   dominates the other, but the dominated value is not ``BASE``, but a second
>> +   intermediate value. Most scalars are rarely changed, so this is unlikely to
>> +   be an issue. And the trade-off is having to generate and inspect the
>> +   per-scalar graph.
>
> Try to avoid starting sentences with "And".

Cleaned up a few.

>
>> +   If there are no LCA values that are different from ``BASE``, we use a simple
>> +   3-way merge with ``BASE`` as the base value.
>> +
>> +3. Find the unique set of LCA values that do not include the ``BASE`` value.
>> +   If there is only one unique LCA value, we again use three-way merge logic
>> +   using that unique value as the base.
>> +
>> +4. At this point we have determined that we have at least 2 unique values in
>
> Comma after point
>
>> +   our LCAs which means that ``THIS`` and ``OTHER`` would both have to resolve
>> +   the conflict. If they resolved it in the same way, we would have caught that
>> +   in step 1. So they either both picked a different LCA value, or one (or
>> +   both) chose a new value to use.
>> +
>> +   So at this point, if ``OTHER`` and ``THIS`` both picked a different LCA
>> +   value, we conflict.
>> +
>> +   If ``OTHER`` and ``THIS`` both have values that are not LCA values, we also
>> +   conflict. (Same as 3-way, both sides modified a value in different ways.)
>> +
>> +5. (optional) The only tricky part is this, if ``OTHER`` has a LCA value, but
>
> Comma after "this" should be a colon.
>
>> +   ``THIS`` does not, then we go with ``THIS``, and conversely if ``THIS`` has
>> +   an LCA value, but ``OTHER`` does not, then we go with ``OTHER``. The idea is
>> +   that ``THIS`` and ``OTHER`` may have resolved things in the same way, and
>> +   then later changed the value to something newer. (They could have also
>> +   resolved it differently, and then one side updated again.)
>
> Anyhow, I think this describes a reasonable technique.  I had always
> assumed that we would need to change our inventory representation in
> order to do any sort of history-aware inventory merging.  My hat is off
> to you for figuring out how to squeeze it out of the information we
> already have.

I'm attaching a version which responds to most of your requests. The
remaining ones are:

1) Modifying Graph.find_unique_lca. I feel like this is outside the
scope of this patch. It certainly isn't trivial to fix.

2) Merger.revision_trees(). I *can* implement this, though there is only
one caller.

3) Filtering in MultiWalker. We could, but I don't think it would be a
net win, and this code is already written. If you feel strongly, I can
do so.

4) Changing _entries_lca to return everything where THIS != OTHER. The
point of what I wrote is to only return entries where OTHER != "base"
(where base is also made up of lcas as appropriate.)

I appreciate you reviewing this. I know it gets hairy, there are some
outlandish edge-cases that happen to exist in the real world. I don't
feel like I've really over engineered it, but I'm willing to discuss it
a bit more.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjApO4ACgkQJdeBCYSNAANoUQCgzE17O4cW/W4ZgSf5GQ2Jfp7d
AXYAnA9Mxgj2GwPsrCdRTyVy2ycPIVEn
=7B1X
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: merge_lca_multi2.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080904/93afffb1/attachment-0001.diff