[MERGE] more common-ancestor performance improvements.

Mon Jun 18 17:58:12 BST 2007

John Arbash Meinel has voted +1 (conditional).
Status is now: Conditionally approved
Comment:
I think I was expecting Robert to re-review, since he seemed the most 
capable for reviewing the "graph" code. I'll do what I can, though. I 
don't feel like I have a great grasp of the algorithm you are using (for 
example, how can you have something that is a border ancestor but is a 
descendant of another border ancestor. I would at least think it would 
make it a non-border ancestor....)

It might be good to have the algorithm clearly documented, with 'dot' 
examples of the different edge cases we are handling and why. Not that I 
would make that a strict requirement for merging, but it would make it a 
bit easier to understand why the code is doing what it does.

+    def get_parents(self, revision_ids):
+        """
+        Find revision ids of the parents of a list of revisions
+
+        A list is returned of the same length as the input.  Each entry

I'm pretty sure we put the opening comment on the same line as """. So 
it should look like:
+    def get_parents(self, revision_ids):
+        """Find revision ids of the parents of a list of revisions
+
+        A list is returned of the same length as the input.  Each entry

I don't know that we've settled on whether the opening line should end 
in a '.' or not.

I may be misunderstanding this loop, but it looks broken to me.
+        found = {}
+        for parents_provider in self._parent_providers:
+            parent_list = parents_provider.get_parents(
+                [r for r in revision_ids if r not in found])
+            new_found = dict((k, v) for k, v in zip(revision_ids, 
parent_list)
+                             if v is not None)
+            found.update(new_found)
+            if len(found) == len(revision_ids):
+                break
+        return [found.get(r) for r in revision_ids]

Specifically, it looks like you are trying to ask for parents of 
revisions from multiple sources. However, there are a few things which 
look wrong:

You get the "parent_list" from a filtered version of revision_ids (by 
checking if they have already been found). However, you insert them into 
a new dictionary by exact matching with revision_ids. (zip(revision_ids, 
parent_list)).

Also, as near as I can tell, this loop has no way of telling you what 
parents a "provider" has access to, so it has to return something for 
every revision that you request of it.

Ultimately, this loop looks like if you ever had to make more than 1 
pass, it would break in very bad ways. You do allow 'get_parents()' to 
return None, so probably you just need to keep the filtered list.

+        return [found.get(r) for r in revision_ids]

I really prefer the explicit "found.get(r, None)" I realize that 
dict.get() defaults to returning None, but I don't find that directly 
obvious, and I think some other people are the same way. (Especially 
since getattr works differently, so it isn't a consistent python thing)

I would like to make a small comment about another possible 
implementation. Rather than keeping a bunch of sets() of nodes, we could 
just create a specific node object for every revision. Which would have 
attributes for whether it had been seen by the different branches. I 
would guess it has better performance, because it doesn't have to look 
things up as another set/dictionary. Instead it is just a member on an 
object that you needed anyway. Just a thought.

_find_border_ancestors needs a doc string to document its return value. 
It sounds like you are only finding borders, but you are also finding 
'common' and 2 sets based on what revisions you have to traverse 
(searchers.seen) on each side.

+    def __repr__(self):
+        return '_BreadthFirstSearcher(self._search_revisions=%r,' \
+            ' self.seen=%r)' % (self._search_revisions, self.seen)

I generally prefer the "parenthesis" form to the '\' form. So you could 
write this as:

return ('_BreadthFirstSearcher(self._search_revisions=%r,'
         ' self.seen=%r)') % (self._search_revisions, self.seen)

I can understand the extra parentheses may be a bit confusing.

+    def find_seen_ancestors(self, revision):
+        """Find ancstors of this revision that have already been 
seen."""

^- typo

+    def stop_searching_any(self, revisions):
+        """
+        Remove any of the specified revisions from the search list.
+
+        None of the specified revisions are required to be present in 
the
+        search list.  In this case, the call is a no-op.
+        """
+        stopped_searches = set(l for l in self._search_revisions
+                               if l in revisions)
+        self._search_revisions = set(l for l in self._search_revisions
+                                     if l not in revisions)
+        return stopped_searches

I'm pretty sure this can be done more easily with:

stopped = self._search_revisions.intersection(revisions)
self._search_revisions = self._search.revisions.difference(revisions)

assuming 'revisions' is a set. But if it isn't, you really shouldn't be 
doing 'if l not in revisions', since that is a linear search.

Similarly, you probably could do:
+            self._search_revisions.update(r for r in revisions if
+                                          r not in self.seen)

self._search_revisions.update(r.difference(self.seen))

you need some extra whitespace in 'test_graph.py' between top-level 
entries. I'm not worried too much about the different variables, but 
between the import and variable defs, and then variables and "class" 
code.

It would be pretty sweet to have 'dot' graphs of the different test 
cases, since it would make understanding them a lot easier (and I know 
you like to create them :)) That is just icing, though. But having 
"python bzrlib/tests/test_graph.py" spit out the dots/png files would be 
really cool.

Barring that, having a simple single-line description of what they are 
trying to test:

ancestry_1 => simple ancestry with merge and an extra revision on one 
side
ancestry_2 => ancestry with 2 roots
criss_cross => (probably obvious from the name)
I guess criss_cross2 is a criss-cross with 2 roots

You probably should have a try/finally around "tree = self.prepare_..." 
and "tree.unlock()" to help avoid spurious warnings/errors when 
something goes wrong.

This seems very oddly indented:
+        self.assertEqual(set(['rev2b']),
+                         graph.find_lca('rev3a', 'rev3b',
+                                                     'rev2b'))
+

If 'deprecated_graph' is really deprecated, then shouldn't the functions 
there be @deprecated(), and Knit should *not* use it?

I suppose that can be done in a later pass.

For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3C466DEF52.5070205%40utoronto.ca%3E