[MERGE] more common-ancestor performance improvements.
John Arbash Meinel
john at arbash-meinel.com
Mon Jun 18 17:58:12 BST 2007
John Arbash Meinel has voted +1 (conditional).
Status is now: Conditionally approved
Comment:
I think I was expecting Robert to re-review, since he seemed the most
capable for reviewing the "graph" code. I'll do what I can, though. I
don't feel like I have a great grasp of the algorithm you are using (for
example, how can you have something that is a border ancestor but is a
descendant of another border ancestor. I would at least think it would
make it a non-border ancestor....)
It might be good to have the algorithm clearly documented, with 'dot'
examples of the different edge cases we are handling and why. Not that I
would make that a strict requirement for merging, but it would make it a
bit easier to understand why the code is doing what it does.
+ def get_parents(self, revision_ids):
+ """
+ Find revision ids of the parents of a list of revisions
+
+ A list is returned of the same length as the input. Each entry
I'm pretty sure we put the opening comment on the same line as """. So
it should look like:
+ def get_parents(self, revision_ids):
+ """Find revision ids of the parents of a list of revisions
+
+ A list is returned of the same length as the input. Each entry
I don't know that we've settled on whether the opening line should end
in a '.' or not.
I may be misunderstanding this loop, but it looks broken to me.
+ found = {}
+ for parents_provider in self._parent_providers:
+ parent_list = parents_provider.get_parents(
+ [r for r in revision_ids if r not in found])
+ new_found = dict((k, v) for k, v in zip(revision_ids,
parent_list)
+ if v is not None)
+ found.update(new_found)
+ if len(found) == len(revision_ids):
+ break
+ return [found.get(r) for r in revision_ids]
Specifically, it looks like you are trying to ask for parents of
revisions from multiple sources. However, there are a few things which
look wrong:
You get the "parent_list" from a filtered version of revision_ids (by
checking if they have already been found). However, you insert them into
a new dictionary by exact matching with revision_ids. (zip(revision_ids,
parent_list)).
Also, as near as I can tell, this loop has no way of telling you what
parents a "provider" has access to, so it has to return something for
every revision that you request of it.
Ultimately, this loop looks like if you ever had to make more than 1
pass, it would break in very bad ways. You do allow 'get_parents()' to
return None, so probably you just need to keep the filtered list.
+ return [found.get(r) for r in revision_ids]
I really prefer the explicit "found.get(r, None)" I realize that
dict.get() defaults to returning None, but I don't find that directly
obvious, and I think some other people are the same way. (Especially
since getattr works differently, so it isn't a consistent python thing)
I would like to make a small comment about another possible
implementation. Rather than keeping a bunch of sets() of nodes, we could
just create a specific node object for every revision. Which would have
attributes for whether it had been seen by the different branches. I
would guess it has better performance, because it doesn't have to look
things up as another set/dictionary. Instead it is just a member on an
object that you needed anyway. Just a thought.
_find_border_ancestors needs a doc string to document its return value.
It sounds like you are only finding borders, but you are also finding
'common' and 2 sets based on what revisions you have to traverse
(searchers.seen) on each side.
+ def __repr__(self):
+ return '_BreadthFirstSearcher(self._search_revisions=%r,' \
+ ' self.seen=%r)' % (self._search_revisions, self.seen)
I generally prefer the "parenthesis" form to the '\' form. So you could
write this as:
return ('_BreadthFirstSearcher(self._search_revisions=%r,'
' self.seen=%r)') % (self._search_revisions, self.seen)
I can understand the extra parentheses may be a bit confusing.
+ def find_seen_ancestors(self, revision):
+ """Find ancstors of this revision that have already been
seen."""
^- typo
+ def stop_searching_any(self, revisions):
+ """
+ Remove any of the specified revisions from the search list.
+
+ None of the specified revisions are required to be present in
the
+ search list. In this case, the call is a no-op.
+ """
+ stopped_searches = set(l for l in self._search_revisions
+ if l in revisions)
+ self._search_revisions = set(l for l in self._search_revisions
+ if l not in revisions)
+ return stopped_searches
I'm pretty sure this can be done more easily with:
stopped = self._search_revisions.intersection(revisions)
self._search_revisions = self._search.revisions.difference(revisions)
assuming 'revisions' is a set. But if it isn't, you really shouldn't be
doing 'if l not in revisions', since that is a linear search.
Similarly, you probably could do:
+ self._search_revisions.update(r for r in revisions if
+ r not in self.seen)
self._search_revisions.update(r.difference(self.seen))
you need some extra whitespace in 'test_graph.py' between top-level
entries. I'm not worried too much about the different variables, but
between the import and variable defs, and then variables and "class"
code.
It would be pretty sweet to have 'dot' graphs of the different test
cases, since it would make understanding them a lot easier (and I know
you like to create them :)) That is just icing, though. But having
"python bzrlib/tests/test_graph.py" spit out the dots/png files would be
really cool.
Barring that, having a simple single-line description of what they are
trying to test:
ancestry_1 => simple ancestry with merge and an extra revision on one
side
ancestry_2 => ancestry with 2 roots
criss_cross => (probably obvious from the name)
I guess criss_cross2 is a criss-cross with 2 roots
You probably should have a try/finally around "tree = self.prepare_..."
and "tree.unlock()" to help avoid spurious warnings/errors when
something goes wrong.
This seems very oddly indented:
+ self.assertEqual(set(['rev2b']),
+ graph.find_lca('rev3a', 'rev3b',
+ 'rev2b'))
+
If 'deprecated_graph' is really deprecated, then shouldn't the functions
there be @deprecated(), and Knit should *not* use it?
I suppose that can be done in a later pass.
For details, see:
http://bundlebuggy.aaronbentley.com/request/%3C466DEF52.5070205%40utoronto.ca%3E
More information about the bazaar
mailing list