[MERGE][1.0] Graph.heads() optimization... Now 1770x
John Arbash Meinel
john at arbash-meinel.com
Tue Dec 18 15:18:33 GMT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Aaron Bentley wrote:
> John Arbash Meinel wrote:
>
> Sorry to take so long getting to this review. This graph stuff is
> always challenging for me.
>
Thanks for looking it over. I'm actually doing a "complete rewrite" in a new
branch. Just to address all the different things. And I've been waiting to get
the find_differences() stuff correct before I submitted again.
But I'll break it down into some smaller changes, and try to get them reviewed
again.
>> So I went ahead and change the functions so that step() actually takes a
>> dictionary of parents that it is going to care about. And made
>> "preload_parents" just a member of Graph. Which allows me to make it private,
>> and doesn't involve running into a bzr-svn implementation that isn't providing it.
>
> It would be great if you could rename that to something clearer, like
> get_parents_map
Will be done.
...
> + def get_parents(self, revision_ids):
> + """Find revision ids of the parents of a list of revisions
> +
> + A list is returned of the same length as the input. Each entry
> + is a list of parent ids for the corresponding input revision.
> +
> + [NULL_REVISION] is used as the parent of the first user-committed
> + revision. Its parent list is empty.
> +
> + If the revision is not present (i.e. a ghost), None is used in
> place
> + of the list of parents.
> + """
> + needed = set()
>
> ^^^ I think this docstring should just reference the
> StackedParentsProvider one, like the others do.
Sure.
...
> if revision_id not in cache:
> needed.add(revision_id)
>
> + if needed:
> + new_parents = dict(zip(needed,
> +
> self._real_provider.get_parents(needed)))
>
> ^^^ I prefer more explicit tests, like "if len(needed) > 0". Still, I
> wouldn't demand such a change.
The only problem with "if len(needed) > 0" is that it incurs a bunch of
function overhead. You have a len() call, which is fairly fast, but still has
to reference the a PyObject for the number, and then a comparison test, versus
a simple "if" check. That said, I don't think it would rate high on any
profiling tests. (Though I have seen 1M calls to len() start to add up.)
And regardless, I *prefer* the style of "if needed". I know you have concerns
about having '' and None, and [] all evaluating to False. In this particular
case, I don't think it is an issue since the scope of needed is very small. I'm
happy to conform to whatever style guidelines we decide for Bazaar code, though.
>
> class Graph(object):
> """Provide incremental access to revision graphs.
>
> @@ -183,40 +228,36 @@
> active_searchers = searchers[:]
> border_ancestors = set()
> def update_common(searcher, revisions):
> - w_seen_ancestors = searcher.find_seen_ancestors(
> - revision)
> + w_seen_ancestors = searcher.find_seen_ancestors(revisions)
>
> ^^^ Is this a longstanding bug?
>
Yes. It only worked because all inner loops were only passing a single value
in, and all of them were using the variable "revision". So the code is actually
correct, but for the wrong reasons.
But in my next update, I get rid of this function anyway.
...
> ^^^ "just stop it" comment no longer makes sense.
>
> + def will_search(self):
> + """Give the list of nodes that will be searched next."""
> + if self._search_revisions is None:
> + return self._start
>
> ^^^ I don't think this is really accurate. The next iteration will
> yield the _start revisions. It won't traverse from them into their
> parents. It's mostly harmless, but it's probably causing an extra
> Graph._preload_parents call.
>
Yeah, you don't actually need to preload the nodes that are in _start. I
addressed it in a different way since, but I'll keep it in mind during the cleanup.
> - def find_seen_ancestors(self, revision):
> - """Find ancestors of this revision that have already been
> seen."""
> - searcher = _BreadthFirstSearcher([revision],
> self._parents_provider)
> + def find_seen_ancestors(self, revisions):
> + """Find ancestors of these revisions that have already been
> seen."""
> + revisions = self.seen.intersection(revisions)
> seen_ancestors = set()
> - for ancestors in searcher:
> - for ancestor in ancestors:
> - if ancestor not in self.seen:
> - searcher.stop_searching_any([ancestor])
> - else:
> - seen_ancestors.add(ancestor)
> + while revisions:
> + seen_ancestors.update(revisions)
> + next_revisions = set()
> + for parents in self._parents_provider.get_parents(revisions):
> + if parents is None:
> + continue
> + next_revisions.update(self.seen.intersection(parents))
> + revisions = next_revisions
> return seen_ancestors
>
> ^^^ I don't understand the reasons for this change. You're doing a
> breadth-first traversal; why not use a breadth-first searcher?
>
Off-hand it was just part of understanding the code. I was seeing a lot of time
spent in 'find_seen_ancestors', but I know think it was because we were
effectively doing:
for node in new_common:
for searcher in searchers:
seen = searcher.find_seen_ancestors(node)
...
Which is doing N searches on each searcher. I'll take a closer look and see if
I can go back to using a BFF.
Also, part of my original rewrite moved more logic into BFF making them
heavier-weight, but that turned out to be a bad mistake anyway.
...
>
> The key things I'd like to see are:
> 1. update preload_parents to get_parents_map API
> 2. comment updates.
>
> I'd also appreciate an explanation of why BreadthFirstSearcher can't be
> used efficiently for find_seen_ancestors, but I trust you have measured
> an efficiency gain there.
>
> Aaron
I'm going to go ahead and do a get_parents_map patch first, and then work some
more patches on top of that.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHZ+TIJdeBCYSNAAMRAgSNAJ9cR6zNJ/JchECZChbLICMij/0iggCgiw25
/s0wYrlNDqna8RaKMyI9rHI=
=5Gkl
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list