[MERGE] more common-ancestor performance improvements.
John Arbash Meinel
john at arbash-meinel.com
Mon Jun 18 19:35:01 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Aaron Bentley wrote:
...
>>> I really prefer the explicit "found.get(r, None)" I realize that
>>> dict.get() defaults to returning None, but I don't find that directly
>>> obvious, and I think some other people are the same way. (Especially
>>> since getattr works differently, so it isn't a consistent python thing)
>
> To me, the whole point of dict.get is that it returns None instead of
> raising KeyError. And getattr is broken for not following that convention.
Well, for certain apis it is useful to pass a callable object, and you
can't exactly pass "dict[]". You *can* pass 'dict.__getitem__", but
passing 'dict.get()' is much more natural. Any time you have to go
directly to an '__foo__' function seems like an API violation to me.
*I* would prefer it if 'dict.get()' raised if you didn't pass a third item.
>
>>> I would like to make a small comment about another possible
>>> implementation. Rather than keeping a bunch of sets() of nodes, we could
>>> just create a specific node object for every revision. Which would have
>>> attributes for whether it had been seen by the different branches.
>
> I'm not sure what you mean by "branches" here. You mean the
> BreadthFirstSearchers? Then we'd need some concept of Searcher
> identity, I think. Using object identity could interfere with garbage
> collection.
You could simply use the revision_id of the tip revision you are
searching on. So each node would have (hand waving):
class Node(object):
def __init__(self, revision_id):
self.revision_id = revision_id
self.ancestor_of = set()
And then the check would be something like:
node = searcher.next()
if len(node.ancestor_of) == len(searchers):
# We have commonality between all searchers
# Can you filter if only 2 searchers are common, or does it have
# to be common to *all* searchers?
I'm just thinking about locality, and dictionary lookups. For a given
node the number of "ancestor_of" entries is at most the number of
searchers (aka 2, 3, etc).
While having "searcher.seen()" is of size O(partial history)
And while python dictionary lookups are supposedly O(1), I think they
are actually O(f(N)) though I'm not sure if that is log(log()) or
something small, but I've seen performance such that it didn't really
scale with N.
>
>>> I
>>> would guess it has better performance, because it doesn't have to look
>>> things up as another set/dictionary.
>
> I'm not sure whether this would be a win or not. Because the dictionary
> would be larger that way.
Well, you have 1 dictionary which is O(all revisions seen), or maybe the
nodes are just a graph rather than a dictionary. So each node is:
class Node(object):
def _get_parents(self):
if self._parents is None:
self._parents = \
self.parent_providers.get_parents(self.revision_id)
return self._parents
And then the graph searching is done over the nodes, with no indirection
through a dictionary.
This may also be a net loss in python because of how much more efficient
dict and lists are versus generic objects, but in something like C/Pyrex
whatever you end up working with some pointers that move around a graph
updating the specific nodes.
>
>>> Instead it is just a member on an
>>> object that you needed anyway. Just a thought.
>
It is mostly just me thinking about how it could be implemented.
Certainly having an algorithm with the right big O is more important
than worrying about the constant factors at this point. (Better to have
a non-optimal O(log(N)) than an optimal O(N^2))
>>> _find_border_ancestors needs a doc string to document its return value.
>>> It sounds like you are only finding borders, but you are also finding
>>> 'common' and 2 sets based on what revisions you have to traverse
>>> (searchers.seen) on each side.
>
> The number of searchers.seen will actually depend on the number of
> inputs to find_border_ancestors.
Sure, it just needs to be mentioned, because when I first looked at it,
I didn't understand what it was doing.
...
>>> + def stop_searching_any(self, revisions):
>>> + """
>>> + Remove any of the specified revisions from the search list.
>>> +
>>> + None of the specified revisions are required to be present in the
>>> + search list. In this case, the call is a no-op.
>>> + """
>>> + stopped_searches = set(l for l in self._search_revisions
>>> + if l in revisions)
>>> + self._search_revisions = set(l for l in self._search_revisions
>>> + if l not in revisions)
>>> + return stopped_searches
>>>
>>> I'm pretty sure this can be done more easily with:
>>>
>>> stopped = self._search_revisions.intersection(revisions)
>>> self._search_revisions = self._search.revisions.difference(revisions)
>
> I was second-guessing search_revisions.intersection, because I knew that
> self._search_revisions was way smaller than revisions.
>
> Probably a bad idea. I'll fix.
Well, I tried doing this:
% TIMEIT -s "import random" \
-s "big = set(random.randint(0, 50000) for x in xrange(30000))"
-s "little = set(random.randint(0, 50000) for x in xrange(10))"
"big.intersection(little)"
1.85usec
and
"little.intersection(big)"
1.85usec
Versus
x = set(l for l in little if l in big)
6.6usec
Not as big of a difference as I would think, but I'm guessing that
set.intersection() might already be using 'len()' to figure out which
list is more restrictive. Since set() tracks its size, it should be
relatively cheap.
>
>>> Similarly, you probably could do:
>>> + self._search_revisions.update(r for r in revisions if
>>> + r not in self.seen)
>>>
>>> self._search_revisions.update(r.difference(self.seen))
>
> Yeah, probably. I'll try it.
>
>>> you need some extra whitespace in 'test_graph.py' between top-level
>>> entries. I'm not worried too much about the different variables, but
>>> between the import and variable defs, and then variables and "class" code.
>
> Okay.
>
>>> It would be pretty sweet to have 'dot' graphs of the different test
>>> cases, since it would make understanding them a lot easier (and I know
>>> you like to create them :)) That is just icing, though. But having
>>> "python bzrlib/tests/test_graph.py" spit out the dots/png files would be
>>> really cool.
>
> I'm gonna go with ASCII art for now, because it keeps the diagram close
> to the code.
Sure. As long as the graphs are small, ASCII art works pretty well.
>
>>> Barring that, having a simple single-line description of what they are
>>> trying to test:
>>>
>>> ancestry_1 => simple ancestry with merge and an extra revision on one side
>>> ancestry_2 => ancestry with 2 roots
>>> criss_cross => (probably obvious from the name)
>>> I guess criss_cross2 is a criss-cross with 2 roots
>
> Okay.
>
>>> You probably should have a try/finally around "tree = self.prepare_..."
>>> and "tree.unlock()" to help avoid spurious warnings/errors when
>>> something goes wrong.
>>>
>>>
>>> This seems very oddly indented:
>>> + self.assertEqual(set(['rev2b']),
>>> + graph.find_lca('rev3a', 'rev3b',
>>> + 'rev2b'))
>>> +
>>>
>>>
>>>
>>> If 'deprecated_graph' is really deprecated, then shouldn't the functions
>>> there be @deprecated(), and Knit should *not* use it?
>
> Well, I have to get the infrastructure in before we can start to replace
> everything that uses it. I'm not that far along yet.
>
> Aaron
Sure. Overall I think I like what you have.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGdtBVJdeBCYSNAAMRAhWMAJwNbqbmAVxukB69G5f1d09JTjONFACgyjQF
qF4PpI4YIHUbwZUN0nSrSks=
=n7pS
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list