[MERGE] LCA Merge resolution for tree-shape

Mon Sep 8 16:55:45 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> The tests are, if anything, harder to read than the implementation.
> 
> I've tried to add a bit more comments to the tests, but honestly, the
> tests are complex because they are testing genuine edge cases. Which can
> easily correspond to broken code.

In many of the tests, most of your parameters to builder.build_snapshot
are predictable or repeated.  It would help greatly if you had helper
functions so that you only had to supply parameters for the things you
were varying.

>> I was actually hoping that in the future, rename functionality would be
>> implemented in terms of apply_inventory_delta.
> 
> I seem to recall trying apply_inventory_delta first, but it actually
> meant implementing a lot more of the api, because
> MT.apply_inventory_delta wants to do stuff like "_write_inventory()"
> which doesn't make sense for a MemoryTree.
> Also, at some point you have to move the things on disk, which is a MT
> function, not an inventory one.

Yes, there is that half.  Maybe we could use a Transport, but never
mind.  We need to turn a lot of the WorkingTree tests into MutableTree
tests, so that MT is properly tested.  There's no reason it shouldn't
support apply_inventory_delta, for example.  And PreviewTree should
support it fairly trivially, too.

>>>          self._revision_graph = revision_graph
>>>          self._base_is_ancestor = None
>>>          self._base_is_other_ancestor = None
>>> +        self._is_criss_cross = False
>> ^^^ None is a better default than False when no value has been
>> determined.  It also matches the surrounding code better.
> 
> 
> I suppose. There is only one place that can set it to True, and several
> paths that should set it to False. I went ahead and did it, and then
> just default it to False in 'find_base'.

find_base is only called if the base revision is not supplied.  So code
needs to be able to tell if a criss_cross check has been done.

>>> +            else: # len(lcas) > 1
>> ^^^ What is this commented-out code for?
> 
> Just documenting the case which is being handled in the 'else'
> statement. It isn't commented-out code. It is a code comment. I can
> remove it. Probably it evolved from when the sections were a bit longer,
> and it may have been a bit more unclear about what was going on.

Okay.  I guess I'm fine with it then.  Maybe "i.e. len(lcas) > 1" would
be clearer.

>> If you don't want to change the Graph API, another option is to
>> implement graph.find_unique_lca at the module level.
>> Graph.find_unique_lca could then be implemented in terms of
>> graph.find_unique_lca.
> 
> That doesn't sound a whole lot better. Now we have 2 functions that
> might be addressed as "graph.find_unique_lca" only with different
> signatures.

I do think that's better than having to work around a deficiency in the
API.  If it's just a naming issue, I'm sure we can find a different name
to use.

It seems to me like a form that iterated LCAs might be ideal.  Then you
could do:

iterator = graph.search_lcas()
values = iterator.next()
if len(values) == 1:
  criss_cross = False
  lcas = []
else:
  criss_cross = True
  lcas = values
  for values in iterator:
    pass
(base,) = values

> Re-traversing the graph is unfortunate, but not terrible. And I'd like
> to postpone rewriting find_unique_lca on top of the rest of this.

You always have the option of rewriting find_unique_lca as a separate
patch that we merge *first*.

>> It seems like we need to implement Merger.revision_trees (or at least a
>> private method):
> 
>> 1. touching _cached_trees in find_base is a layering issue, and
>> duplicates code
>> 2. failing to use Tree.revision_tree means that we cannot use the
>> optimizations in DirStateRevisionTree.
>> 3. it means fewer special cases in find_base
> 
> 
> So the big win is to extract the base tree along with the revision trees
> at the same time. I would guess extracting them all at the same time is
> a bigger win than DirStateRevisionTree.

I don't think the two are in conflict.  For the LCA case, you usually
won't get any DirStateRevisionTrees, so you'll extract them all at once.
 For other merge cases, you'll get DirStateRevisionTrees, and that will
be a win.

In any case, you haven't provided an argument against 1 or 3.

>>>              for num, (file_id, changed, parents3, names3,
>>>                        executable3) in enumerate(entries):
>> ^^^ I think these variable names should be changed.
> 
> What specific names? the parents3? or the 'entries' or 'resolver'?

parents3, names3, executable3.

>>> @@ -614,6 +680,178 @@
>>>              result.append((file_id, changed, parents3, names3, executable3))
>>>          return result
>>> +    def _entries_lca(self):
>>> +        """Gather data about files modified between multiple trees.
>>> +
>>> +        This compares OTHER versus all LCA trees, and for interesting entries,
>>> +        it then compares with THIS and BASE.
>>> +
>>> +        For the multi-valued entries, the format will be (BASE, [lca1, lca2])
>> This seems inaccurate.  The value is actually (BASE, [lca1, lca2]),
>> OTHER, THIS
> 
>> But I don't think it should be.  BASE, LCAS, OTHER, THIS makes a lot
>> more sense to me.  I don't see a need to join the BASE with the LCAs in
>> a tuple.
> 
> So the specific reason to do so is because of code like
> "_merge_names(names3)" which unpacks the tuple. It does:
> base_name, other_name, this_name = names
> 
> And then ignores 'base_name'. So by putting "(base, [lca1, lca2])" into
> that single variable, I didn't have to update any other code.

Yes, and you wrote more confusing code in the process.  This is a
workaround.  Workarounds lead to a messy codebase.  Sometimes, they are
necessary to preserve API compatibility, but this is not once of those
times.

I'm not convinced that it makes sense to retain the iter_entries /
_merge_names pairing, now that we're using LCA merge resolution.  If you
think that division still makes sense, we should update it to handle
current conditions.

>>> +        :return: [(file_id, changed, parents, names, executable)]
>>> +            file_id     Simple file_id of the entry
>>> +            changed     Boolean, True if the kind or contents changed
>>> +                        else False
>>> +            parents     ((base, [parent_id, in, lcas]), parent_id_other,
>>> +                         parent_id_this)
>>> +            names       ((base, [name, in, lcas]), name_in_other, name_in_this)
>>> +            executable  ((base, [exec, in, lcas]), exec_in_other, exec_in_this)
>>> +        """
>>> +        if self.interesting_files is not None:
>>> +            lookup_trees = [self.this_tree, self.base_tree]
>>> +            lookup_trees.extend(self._lca_trees)
>>> +            # I think we should include the lca trees as well
>>> +            interesting_ids = self.other_tree.paths2ids(self.interesting_files,
>>> +                                                        lookup_trees)
>> Filtering for interesting entries seems like it should be done in
>> MultiWalker.
> 
> We could, but it seemed reasonable to keep MultiWalker simple and filter
> above it. We don't save much CPU filtering inside multi walker (we've
> already built up inventory objects, so there isn't a lot to be saved.)

When we reuse MultiWalker, are we likely to want to reuse the filtering?
 If so, this code belongs somewhere where we can reuse it, e.g. on
MultiWalker.

>>> +        else:
>>> +            interesting_ids = self.interesting_ids
>>> +        result = []
>>> +        # XXX: Do we want a better sort order than this?
>>> +        walker = _mod_tree.MultiWalker(self.other_tree, self._lca_trees)
>> parent-to-child order will help if the merge introduces a lot of new
>> files inside new directories, because it TT will attempt to place files
>> inside parents if the parents are already in limbo.
> 
> It *does* do parent-to-child order

Okay.  I was just responding to your comment, with some advice I hoped
would be useful.

>>> +        base_inventory = self.base_tree.inventory
>>> +        this_inventory = self.this_tree.inventory
>>> +        for path, file_id, other_ie, lca_values in walker.iter_all():
>>> +            # Is this modified at all from any of the other trees?
>>> +            if other_ie is None:
>>> +                other_ie = _none_entry
>>> +            if interesting_ids is not None and file_id not in interesting_ids:
>>> +                continue
>>> +
>>> +            # If other_revision is found in any of the lcas, that means this
>>> +            # node is uninteresting. This is because when merging, if there are
>>> +            # multiple heads(), we have to create a new node. So if we didn't,
>>> +            # we know that the ancestry is linear, and that OTHER did not
>>> +            # modify anything
>>> +            # See doc/developers/lca_merge_resolution.txt for details
>>> +            other_revision = other_ie.revision
>>> +            if other_revision is not None:
>>> +                # We can't use this shortcut when other_revision is None,
>>> +                # because it may be None because things are WorkingTrees, and
>>> +                # not because it is *actually* None.
>> ^^^ When would it *actually* be none unless the trees were workingtrees?
>>  It is required to have a value for RevisionTrees.  _none_entry just
>> makes this logic more confused.
> 
> It is None when the entry is deleted. And it makes the rest of the logic
> *a lot* more simple, because I don't have to put if statements all over
> the rest of the code when extracting variables.

So perhaps you should reassign other_ie *after* this code?

>>> +            kind_winner = Merge3Merger._lca_multi_way(
>>> +                (base_ie.kind, lca_kinds),
>>> +                other_ie.kind, this_ie.kind)
>> Why are we doing this?  That sort of thing should be done in merge_contents.
> 
> Because we have to know if "file content changed or kind changed". It is
> one of the required return fields.

We should know this just by comparing THIS and OTHER.  But we can also
just change the contract so changed=True means "may have changed".

>> When invoking methods on your own class, I think it's best to do it in a
>> generic way, such that subclasses can override those methods.
> 
>> So I'd rather see self._lca_multi_way or perhaps
>> self.__class__._lca_multi_way.  This is the way existing code is written.
> 
> Well, inside static-methods you don't have a 'self'

What you can do is change the staticmethod into a classmethod.  Then you
can call klass._lca_multi_way.

> but other than that
> I can do so. I think bzrlib itself is a bit fragmented in how it
> approaches this. I think Robert has mentioned that class-level functions
> should be accessed via the class, to avoid confusing where the 'self'
> value is coming from.

I think that defeats inheritance, so it is bad.

>>> === modified file 'bzrlib/tests/test_merge.py'
>>> --- bzrlib/tests/test_merge.py	2008-07-16 16:59:32 +0000
>>> +++ bzrlib/tests/test_merge.py	2008-08-01 16:40:01 +0000
>>> @@ -1089,3 +1091,1635 @@
>>>      def test_merge_move_and_change(self):
>>>          self.expectFailure("lca merge doesn't conflict for move and change",
>>>              super(TestLCAMerge, self).test_merge_move_and_change)
>>> +
>>> +
>>> +class LoggingMerger(object):
>>> +    # These seem to be the required attributes
>>> +    requires_base = False
>>> +    supports_reprocess = False
>>> +    supports_show_base = False
>>> +    supports_cherrypick = False
>>> +    # We intentionally do not define supports_lca_trees
>>> +
>>> +    def __init__(self, *args, **kwargs):
>>> +        self.args = args
>>> +        self.kwargs = kwargs
>>> +
>>> +
>>> +class TestMergerBase(TestCaseWithMemoryTransport):
>>> +    """Tests for Merger that can be done without hitting disk."""
>> This description doesn't seem accurate.  The class has no tests.
> 
>     """Common functionality for Merger tests that don't write to disk."""
> 
> Is that better?

Yes, thank you.

>> Do you mean to have two slashes here?  Or did you mean
> 
>>                 A
>>                 |\
>>                 B C
> 
> It is a python string, so a single '\' is escaping the next newline
> character. I'll turn it into a comment where it won't cause problems.

Doh!  Yes, turning it into a comment seems like the best option.

>>> +        # If we swap the order, we should get a different lca order
>> Is the LCA order significant?  It seems like it shouldn't be.
> 
> Mostly it is because ordering that means that all the tests can check
> that each LCA's value was properly copied across, without having to
> resort to strange sorting tricks later on.
> 
> It doesn't change the *final* outcomes (modified, not modified, etc.) It
> just makes the tests easier to write.

When order isn't significant, I usually compare sets.  But we can leave
it like this.

>> Can't you just use LoggingMerger here instead of defining a new class?
> 
> I felt the test was clearer by using an explicitly named class and a
> comment. If you prefer, I can just use LoggingMerger and a comment.

Yes, I'd prefer that.

>> Also, it's weird to build D-id after E-id.  I realize you're doing it
>> because this causes D-id to be selected by make_merge_object, but could
>> you swap the names?
> 
> I fairly carefully followed the same graph
>   A
>   |\
>   B C
>   |X|
>   D E
> 
> And didn't re-order them. I can build it in any order and then tell the
> builder to use D before I do the merge. I mostly did it this way because
> it is faster. By building in this particular order, the builder gets to
> re-use the left-hand parent most of the time, which saves time of
> regenerating the inventory from the repository, etc.

Well, however you want to change it, I think it's counter-intuitive to
do D after E.

>> Why is this an add, not modify?  The file exists in C-id, and D-id is
>> derived from it.
> 
> Because when building a snapshot it is based strictly on its left-hand
> parent. We don't actually run any "merge" code. So if you have:
> 
> A
> |\
> B C  # C introduces a file
> |/
> D    # D appears to introduce the file relative to B
> 
> So we have to 'add' it both times. In fact, if you look, D does not
> modify it relative to C. The limitation is mostly that it makes the
> Builder code simpler, partially because most of the merge code doesn't
> work on MemoryTree.

Merge support for PreviewTree is close, but not quite there.  I hope to
get that into 1.8.

>>> +    def test_file_not_in_one_lca(self):
>>> +        builder = self.get_builder()
>>> +        builder.build_snapshot('A-id', None,
>>> +            [('add', (u'', 'a-root-id', 'directory', None))])
>>> +        builder.build_snapshot('B-id', ['A-id'], [])
>>> +        builder.build_snapshot('C-id', ['A-id'],
>>> +            [('add', (u'a', 'a-id', 'file', 'a\nb\nc\n'))])
>>> +        builder.build_snapshot('E-id', ['C-id', 'B-id'], []) # Inherited from C
>>> +        builder.build_snapshot('D-id', ['B-id', 'C-id'], # Merged from C
>>> +            [('add', (u'a', 'a-id', 'file', 'a\nb\nc\n'))])
>>> +        merge_obj = self.make_merge_obj(builder, 'E-id')
>>> +
>>> +        self.assertEqual(['B-id', 'C-id'], [t.get_revision_id()
>>> +                                            for t in merge_obj._lca_trees])
>>> +        self.assertEqual('A-id', merge_obj.base_tree.get_revision_id())
>>> +
>>> +        entries = list(merge_obj._entries_lca())
>>> +        root_id = 'a-root-id'
>>> +        self.assertEqual([], entries)
>> I'm pretty sure this counts as a changed entry.  Why are we not emitting
>> something?
> 
> Because nothing is changed relative to a common base. D resolved adding
> the file the same way that E resolved adding the file. If you ignore B
> for a second, you can see that for "CDE" there is no change.
> 
>         #   A   # just root
>         #   |\
>         #   B C # B no file, C introduces a file
>         #   |X|
>         #   D E # D and E both have the file, unchanged from C
> 
> ^- I added this as a comment. I feel like there is no change to be
> emitted. What change would you like to see?
> This is, honestly, one of the key reasons for this code, because it can
> detect that this is 'uninteresting'. One LCA has the same value as BASE,
> and the other LCA supersedes it. The new tips have the same value as the
> superseding LCA.

Well, if we can check it simply, fine.  If we wind up duplicating code,
then I'd prefer not to do the filtering.

>> This is mind-melting.  Can we please simplify _entries_lca so that it
>> emits *everything* except entries that are the same in THIS and OTHER?
>> Trying to keep track of when it defers merge resolution to merge_names
>> vs resolving the merge itself is not easy.
> 
> Well, because it does it correctly, and keeps us from having to deal
> with entries that are not genuinely modified. I realize it is complex.
> Most of that is real-world complexity derived from examples in the mysql
> history. I believe this is one of the explicit points that they ran
> into. Their merge case does indeed have a double-criss-cross that causes
> things to look modified when they shouldn't. To get the merge correct,
> we have to treat those paths and content as unmodified. (Because it
> genuinely hasn't given the per-file graph.)

Certainly I'm not trying to say we shouldn't address such cases.  I'm
trying to say we shouldn't address such cases in _entries_lca.

> In order to generate the right values for the various fields (content
> changed, filename changed, etc) I've done 90% of the work of figure out
> whether or not we need to do any more work. Why not just omit the ones
> that I know don't need to be processed further?

This change puts a big dent in our separation of concerns.  I don't have
a big investment in any particular separation of concerns, I just want
it to be clear what they are.

> I'm attaching a version which responds to most of your requests. The
> remaining ones are:
> 
> 1) Modifying Graph.find_unique_lca. I feel like this is outside the
> scope of this patch. It certainly isn't trivial to fix.
> 
> 2) Merger.revision_trees(). I *can* implement this, though there is only
> one caller.
> 
> 3) Filtering in MultiWalker. We could, but I don't think it would be a
> net win, and this code is already written. If you feel strongly, I can
> do so.
> 
> 4) Changing _entries_lca to return everything where THIS != OTHER. The
> point of what I wrote is to only return entries where OTHER != "base"
> (where base is also made up of lcas as appropriate.)

1, 2, and 4 are the only substantial concerns I had, so not addressing
them gets you another

bb:resubmit

> I appreciate you reviewing this. I know it gets hairy, there are some
> outlandish edge-cases that happen to exist in the real world. I don't
> feel like I've really over engineered it, but I'm willing to discuss it
> a bit more.

I don't think you've over-engineered it.  I think you've made
adjustments to the model without adjusting the code to match, making the
codebase messier.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxUsA0F+nu1YWqI0RAoFBAKCHp3ZF8F2WKttCb3KuRyFu8tTpgwCdEzlI
UYYG6JoTjaIm2Qt80QP187Q=
=WHki
-----END PGP SIGNATURE-----