[MERGE][1.0] Graph.heads() optimization... Now 1770x

John Arbash Meinel john at arbash-meinel.com
Tue Dec 18 15:18:33 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
> 
> Sorry to take so long getting to this review.  This graph stuff is
> always challenging for me.
> 

Thanks for looking it over. I'm actually doing a "complete rewrite" in a new
branch. Just to address all the different things. And I've been waiting to get
the find_differences() stuff correct before I submitted again.

But I'll break it down into some smaller changes, and try to get them reviewed
again.


>> So I went ahead and change the functions so that step() actually takes a
>> dictionary of parents that it is going to care about. And made
>> "preload_parents" just a member of Graph. Which allows me to make it private,
>> and doesn't involve running into a bzr-svn implementation that isn't providing it.
> 
> It would be great if you could rename that to something clearer, like
> get_parents_map

Will be done.


...

> +    def get_parents(self, revision_ids):
> +        """Find revision ids of the parents of a list of revisions
> +
> +        A list is returned of the same length as the input.  Each entry
> +        is a list of parent ids for the corresponding input revision.
> +
> +        [NULL_REVISION] is used as the parent of the first user-committed
> +        revision.  Its parent list is empty.
> +
> +        If the revision is not present (i.e. a ghost), None is used in
> place
> +        of the list of parents.
> +        """
> +        needed = set()
> 
> ^^^ I think this docstring should just reference the
> StackedParentsProvider one, like the others do.

Sure.


...

> if revision_id not in cache:
>     needed.add(revision_id)
> 
> +        if needed:
> +            new_parents = dict(zip(needed,
> +
> self._real_provider.get_parents(needed)))
> 
> ^^^ I prefer more explicit tests, like "if len(needed) > 0".  Still, I
> wouldn't demand such a change.

The only problem with "if len(needed) > 0" is that it incurs a bunch of
function overhead. You have a len() call, which is fairly fast, but still has
to reference the a PyObject for the number, and then a comparison test, versus
a simple "if" check. That said, I don't think it would rate high on any
profiling tests. (Though I have seen 1M calls to len() start to add up.)

And regardless, I *prefer* the style of "if needed". I know you have concerns
about having '' and None, and [] all evaluating to False. In this particular
case, I don't think it is an issue since the scope of needed is very small. I'm
happy to conform to whatever style guidelines we decide for Bazaar code, though.


> 
>  class Graph(object):
>      """Provide incremental access to revision graphs.
> 
> @@ -183,40 +228,36 @@
>          active_searchers = searchers[:]
>          border_ancestors = set()
>          def update_common(searcher, revisions):
> -            w_seen_ancestors = searcher.find_seen_ancestors(
> -                revision)
> +            w_seen_ancestors = searcher.find_seen_ancestors(revisions)
> 
> ^^^ Is this a longstanding bug?
> 

Yes. It only worked because all inner loops were only passing a single value
in, and all of them were using the variable "revision". So the code is actually
correct, but for the wrong reasons.

But in my next update, I get rid of this function anyway.


...

> ^^^ "just stop it" comment no longer makes sense.
> 
> +    def will_search(self):
> +        """Give the list of nodes that will be searched next."""
> +        if self._search_revisions is None:
> +            return self._start
> 
> ^^^ I don't think this is really accurate.  The next iteration will
> yield  the _start revisions.  It won't traverse from them into their
> parents.  It's mostly harmless, but it's probably causing an extra
> Graph._preload_parents call.
> 

Yeah, you don't actually need to preload the nodes that are in _start. I
addressed it in a different way since, but I'll keep it in mind during the cleanup.


> -    def find_seen_ancestors(self, revision):
> -        """Find ancestors of this revision that have already been
> seen."""
> -        searcher = _BreadthFirstSearcher([revision],
> self._parents_provider)
> +    def find_seen_ancestors(self, revisions):
> +        """Find ancestors of these revisions that have already been
> seen."""
> +        revisions = self.seen.intersection(revisions)
>          seen_ancestors = set()
> -        for ancestors in searcher:
> -            for ancestor in ancestors:
> -                if ancestor not in self.seen:
> -                    searcher.stop_searching_any([ancestor])
> -                else:
> -                    seen_ancestors.add(ancestor)
> +        while revisions:
> +            seen_ancestors.update(revisions)
> +            next_revisions = set()
> +            for parents in self._parents_provider.get_parents(revisions):
> +                if parents is None:
> +                    continue
> +                next_revisions.update(self.seen.intersection(parents))
> +            revisions = next_revisions
>          return seen_ancestors
> 
> ^^^ I don't understand the reasons for this change.  You're doing a
> breadth-first traversal; why not use a breadth-first searcher?
> 

Off-hand it was just part of understanding the code. I was seeing a lot of time
spent in 'find_seen_ancestors', but I know think it was because we were
effectively doing:

for node in new_common:
  for searcher in searchers:
    seen = searcher.find_seen_ancestors(node)
...

Which is doing N searches on each searcher. I'll take a closer look and see if
I can go back to using a BFF.

Also, part of my original rewrite moved more logic into BFF making them
heavier-weight, but that turned out to be a bad mistake anyway.

...

> 
> The key things I'd like to see are:
> 1. update preload_parents to get_parents_map API
> 2. comment updates.
> 
> I'd also appreciate an explanation of why BreadthFirstSearcher can't be
> used efficiently for find_seen_ancestors, but I trust you have measured
> an efficiency gain there.
> 
> Aaron

I'm going to go ahead and do a get_parents_map patch first, and then work some
more patches on top of that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHZ+TIJdeBCYSNAAMRAgSNAJ9cR6zNJ/JchECZChbLICMij/0iggCgiw25
/s0wYrlNDqna8RaKMyI9rHI=
=5Gkl
-----END PGP SIGNATURE-----



More information about the bazaar mailing list