[MERGE] more common-ancestor performance improvements.

Mon Jun 18 19:35:01 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:

...

>>> I really prefer the explicit "found.get(r, None)" I realize that
>>> dict.get() defaults to returning None, but I don't find that directly
>>> obvious, and I think some other people are the same way. (Especially
>>> since getattr works differently, so it isn't a consistent python thing)
> 
> To me, the whole point of dict.get is that it returns None instead of
> raising KeyError.  And getattr is broken for not following that convention.

Well, for certain apis it is useful to pass a callable object, and you
can't exactly pass "dict[]". You *can* pass 'dict.__getitem__", but
passing 'dict.get()' is much more natural. Any time you have to go
directly to an '__foo__' function seems like an API violation to me.

*I* would prefer it if 'dict.get()' raised if you didn't pass a third item.

> 
>>> I would like to make a small comment about another possible
>>> implementation. Rather than keeping a bunch of sets() of nodes, we could
>>> just create a specific node object for every revision. Which would have
>>> attributes for whether it had been seen by the different branches.
> 
> I'm not sure what you mean by "branches" here.  You mean the
> BreadthFirstSearchers?  Then we'd need some concept of Searcher
> identity, I think.  Using object identity could interfere with garbage
> collection.

You could simply use the revision_id of the tip revision you are
searching on. So each node would have (hand waving):

class Node(object):

  def __init__(self, revision_id):
      self.revision_id = revision_id
      self.ancestor_of = set()

And then the check would be something like:

  node = searcher.next()
  if len(node.ancestor_of) == len(searchers):
    # We have commonality between all searchers
    # Can you filter if only 2 searchers are common, or does it have
    # to be common to *all* searchers?

I'm just thinking about locality, and dictionary lookups. For a given
node the number of "ancestor_of" entries is at most the number of
searchers (aka 2, 3, etc).

While having "searcher.seen()" is of size O(partial history)
And while python dictionary lookups are supposedly O(1), I think they
are actually O(f(N)) though I'm not sure if that is log(log()) or
something small, but I've seen performance such that it didn't really
scale with N.

> 
>>> I
>>> would guess it has better performance, because it doesn't have to look
>>> things up as another set/dictionary.
> 
> I'm not sure whether this would be a win or not.  Because the dictionary
> would be larger that way.

Well, you have 1 dictionary which is O(all revisions seen), or maybe the
nodes are just a graph rather than a dictionary. So each node is:

class Node(object):

  def _get_parents(self):
    if self._parents is None:
       self._parents = \
          self.parent_providers.get_parents(self.revision_id)
    return self._parents

And then the graph searching is done over the nodes, with no indirection
through a dictionary.

This may also be a net loss in python because of how much more efficient
dict and lists are versus generic objects, but in something like C/Pyrex
whatever you end up working with some pointers that move around a graph
updating the specific nodes.

> 
>>> Instead it is just a member on an
>>> object that you needed anyway. Just a thought.
> 

It is mostly just me thinking about how it could be implemented.
Certainly having an algorithm with the right big O is more important
than worrying about the constant factors at this point. (Better to have
a non-optimal O(log(N)) than an optimal O(N^2))

>>> _find_border_ancestors needs a doc string to document its return value.
>>> It sounds like you are only finding borders, but you are also finding
>>> 'common' and 2 sets based on what revisions you have to traverse
>>> (searchers.seen) on each side.
> 
> The number of searchers.seen will actually depend on the number of
> inputs to find_border_ancestors.

Sure, it just needs to be mentioned, because when I first looked at it,
I didn't understand what it was doing.

...

>>> +    def stop_searching_any(self, revisions):
>>> +        """
>>> +        Remove any of the specified revisions from the search list.
>>> +
>>> +        None of the specified revisions are required to be present in the
>>> +        search list.  In this case, the call is a no-op.
>>> +        """
>>> +        stopped_searches = set(l for l in self._search_revisions
>>> +                               if l in revisions)
>>> +        self._search_revisions = set(l for l in self._search_revisions
>>> +                                     if l not in revisions)
>>> +        return stopped_searches
>>>
>>> I'm pretty sure this can be done more easily with:
>>>
>>> stopped = self._search_revisions.intersection(revisions)
>>> self._search_revisions = self._search.revisions.difference(revisions)
> 
> I was second-guessing search_revisions.intersection, because I knew that
> self._search_revisions was way smaller than revisions.
> 
> Probably a bad idea.  I'll fix.

Well, I tried doing this:
% TIMEIT -s "import random" \
         -s "big = set(random.randint(0, 50000) for x in xrange(30000))"
         -s "little = set(random.randint(0, 50000) for x in xrange(10))"
  "big.intersection(little)"
1.85usec

and
  "little.intersection(big)"
1.85usec

Versus
   x = set(l for l in little if l in big)
6.6usec

Not as big of a difference as I would think, but I'm guessing that
set.intersection() might already be using 'len()' to figure out which
list is more restrictive. Since set() tracks its size, it should be
relatively cheap.

> 
>>> Similarly, you probably could do:
>>> +            self._search_revisions.update(r for r in revisions if
>>> +                                          r not in self.seen)
>>>
>>> self._search_revisions.update(r.difference(self.seen))
> 
> Yeah, probably.  I'll try it.
> 
>>> you need some extra whitespace in 'test_graph.py' between top-level
>>> entries. I'm not worried too much about the different variables, but
>>> between the import and variable defs, and then variables and "class" code.
> 
> Okay.
> 
>>> It would be pretty sweet to have 'dot' graphs of the different test
>>> cases, since it would make understanding them a lot easier (and I know
>>> you like to create them :)) That is just icing, though. But having
>>> "python bzrlib/tests/test_graph.py" spit out the dots/png files would be
>>> really cool.
> 
> I'm gonna go with ASCII art for now, because it keeps the diagram close
> to the code.

Sure. As long as the graphs are small, ASCII art works pretty well.

> 
>>> Barring that, having a simple single-line description of what they are
>>> trying to test:
>>>
>>> ancestry_1 => simple ancestry with merge and an extra revision on one side
>>> ancestry_2 => ancestry with 2 roots
>>> criss_cross => (probably obvious from the name)
>>> I guess criss_cross2 is a criss-cross with 2 roots
> 
> Okay.
> 
>>> You probably should have a try/finally around "tree = self.prepare_..."
>>> and "tree.unlock()" to help avoid spurious warnings/errors when
>>> something goes wrong.
>>>
>>>
>>> This seems very oddly indented:
>>> +        self.assertEqual(set(['rev2b']),
>>> +                         graph.find_lca('rev3a', 'rev3b',
>>> +                                                     'rev2b'))
>>> +
>>>
>>>
>>>
>>> If 'deprecated_graph' is really deprecated, then shouldn't the functions
>>> there be @deprecated(), and Knit should *not* use it?
> 
> Well, I have to get the infrastructure in before we can start to replace
> everything that uses it.  I'm not that far along yet.
> 
> Aaron

Sure. Overall I think I like what you have.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGdtBVJdeBCYSNAAMRAhWMAJwNbqbmAVxukB69G5f1d09JTjONFACgyjQF
qF4PpI4YIHUbwZUN0nSrSks=
=n7pS
-----END PGP SIGNATURE-----