[MERGE] more common-ancestor performance improvements.

Mon Jun 18 19:10:48 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> John Arbash Meinel has voted +1 (conditional).
> Status is now: Conditionally approved
> Comment:
> I think I was expecting Robert to re-review, since he seemed the most
> capable for reviewing the "graph" code. I'll do what I can, though. I
> don't feel like I have a great grasp of the algorithm you are using (for
> example, how can you have something that is a border ancestor but is a
> descendant of another border ancestor. I would at least think it would
> make it a non-border ancestor....)

A border ancestor is a common ancestor, and it has at least one
descendant that is not a common ancestor.

Z
|
A
|\
B C
|\|
D E

In this graph, A and B are border ancestors, because A has C as a
descendant and B has E as a descendant.  Z not a border, because it has
no un-common descendants.  C, D and E are not borders because they are
not common.

So B is a border ancestor but is a descendant of another border ancestor.

> It might be good to have the algorithm clearly documented, with 'dot'
> examples of the different edge cases we are handling and why. Not that I
> would make that a strict requirement for merging, but it would make it a
> bit easier to understand why the code is doing what it does.

Hmm, you mean in docs/developers?  Wouldn't it be better as ascii art in
graph.py?  It just seems like it would be a lot of effort to generate
the diagrams when you were browsing code.

> +    def get_parents(self, revision_ids):
> +        """
> +        Find revision ids of the parents of a list of revisions
> +
> +        A list is returned of the same length as the input.  Each entry
> 
> 
> I'm pretty sure we put the opening comment on the same line as """. So
> it should look like:

Okay.  Somehow I'd gotten the impression that only applied to
single-line docstrings.

> I may be misunderstanding this loop, but it looks broken to me.

You're right.  Thanks.

> Also, as near as I can tell, this loop has no way of telling you what
> parents a "provider" has access to, so it has to return something for
> every revision that you request of it.

That's correct.  It returns None if it doesn't have access to the
parents.  I couldn't think of a better way to request data about a
series of revisions and fail softly if some were missing.

Is that a problem?

> Ultimately, this loop looks like if you ever had to make more than 1
> pass, it would break in very bad ways.

Strangely enough, I am testing it with more than one pass, in
test_common_ancestor_two_repos.  Obviously, I got "lucky".

> You do allow 'get_parents()' to
> return None, so probably you just need to keep the filtered list.
> 
> +        return [found.get(r) for r in revision_ids]
> 
> I really prefer the explicit "found.get(r, None)" I realize that
> dict.get() defaults to returning None, but I don't find that directly
> obvious, and I think some other people are the same way. (Especially
> since getattr works differently, so it isn't a consistent python thing)

To me, the whole point of dict.get is that it returns None instead of
raising KeyError.  And getattr is broken for not following that convention.

> I would like to make a small comment about another possible
> implementation. Rather than keeping a bunch of sets() of nodes, we could
> just create a specific node object for every revision. Which would have
> attributes for whether it had been seen by the different branches.

I'm not sure what you mean by "branches" here.  You mean the
BreadthFirstSearchers?  Then we'd need some concept of Searcher
identity, I think.  Using object identity could interfere with garbage
collection.

> I
> would guess it has better performance, because it doesn't have to look
> things up as another set/dictionary.

I'm not sure whether this would be a win or not.  Because the dictionary
would be larger that way.

> Instead it is just a member on an
> object that you needed anyway. Just a thought.

> _find_border_ancestors needs a doc string to document its return value.
> It sounds like you are only finding borders, but you are also finding
> 'common' and 2 sets based on what revisions you have to traverse
> (searchers.seen) on each side.

The number of searchers.seen will actually depend on the number of
inputs to find_border_ancestors.

> +    def __repr__(self):
> +        return '_BreadthFirstSearcher(self._search_revisions=%r,' \
> +            ' self.seen=%r)' % (self._search_revisions, self.seen)
> 
> I generally prefer the "parenthesis" form to the '\' form. So you could
> write this as:
> 
> return ('_BreadthFirstSearcher(self._search_revisions=%r,'
>         ' self.seen=%r)') % (self._search_revisions, self.seen)

Hmm, that does look neater.

> +    def stop_searching_any(self, revisions):
> +        """
> +        Remove any of the specified revisions from the search list.
> +
> +        None of the specified revisions are required to be present in the
> +        search list.  In this case, the call is a no-op.
> +        """
> +        stopped_searches = set(l for l in self._search_revisions
> +                               if l in revisions)
> +        self._search_revisions = set(l for l in self._search_revisions
> +                                     if l not in revisions)
> +        return stopped_searches
> 
> I'm pretty sure this can be done more easily with:
> 
> stopped = self._search_revisions.intersection(revisions)
> self._search_revisions = self._search.revisions.difference(revisions)

I was second-guessing search_revisions.intersection, because I knew that
self._search_revisions was way smaller than revisions.

Probably a bad idea.  I'll fix.

> Similarly, you probably could do:
> +            self._search_revisions.update(r for r in revisions if
> +                                          r not in self.seen)
> 
> self._search_revisions.update(r.difference(self.seen))

Yeah, probably.  I'll try it.

> you need some extra whitespace in 'test_graph.py' between top-level
> entries. I'm not worried too much about the different variables, but
> between the import and variable defs, and then variables and "class" code.

Okay.

> It would be pretty sweet to have 'dot' graphs of the different test
> cases, since it would make understanding them a lot easier (and I know
> you like to create them :)) That is just icing, though. But having
> "python bzrlib/tests/test_graph.py" spit out the dots/png files would be
> really cool.

I'm gonna go with ASCII art for now, because it keeps the diagram close
to the code.

> Barring that, having a simple single-line description of what they are
> trying to test:
> 
> ancestry_1 => simple ancestry with merge and an extra revision on one side
> ancestry_2 => ancestry with 2 roots
> criss_cross => (probably obvious from the name)
> I guess criss_cross2 is a criss-cross with 2 roots

Okay.

> You probably should have a try/finally around "tree = self.prepare_..."
> and "tree.unlock()" to help avoid spurious warnings/errors when
> something goes wrong.
> 
> 
> This seems very oddly indented:
> +        self.assertEqual(set(['rev2b']),
> +                         graph.find_lca('rev3a', 'rev3b',
> +                                                     'rev2b'))
> +
> 
> 
> 
> If 'deprecated_graph' is really deprecated, then shouldn't the functions
> there be @deprecated(), and Knit should *not* use it?

Well, I have to get the infrastructure in before we can start to replace
everything that uses it.  I'm not that far along yet.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGdsqo0F+nu1YWqI0RAgy7AJ9XM4OPZQZqR4rZ6yok5UyJH5OSigCggd87
4z/7x3AgHLkVo1yJAreQCL4=
=7WSS
-----END PGP SIGNATURE-----