[RFC] Non-memoisable BreadthFirstSearcher

Fri Apr 25 07:18:06 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Thu, 2008-04-24 at 21:39 -0500, John Arbash Meinel wrote:
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|>
|> I've been working on the find_differences/find_unique_ancestors code,
|> and I'm finding that quite a few of the changes to _BreadthFirstSearcher
|> made it slower.
|>
|> For example, in "start_searching()" there is a call to _do_query(),
|> because it wants to know if there are any ghosts.
|>
|> However, if I remove that _do_query() call, the average search time
|> drops from 0.097s => 0.063s.
|> Or almost 30% of the time is spent in that _do_query().
|>
|> Some of that might be how I've structured my code. Specifically, I call
|> start_searching() with the assumption that any nodes that have have
|> already been searched will be trivially discarded. (I did that, because
|> otherwise I have to add another parent graph search to find what nodes
|> are trivial to discard.)
|>
|> Anyway, I can't say that I fully understand Robert's changes, to make
|> sure that I don't break anything that he was trying to do.
|>
|> So my feeling was to write a new searcher, which would just be the bare
|> minimum. I would also even consider having some aggregate search
|> objects, because there is a bit of overhead in walking the same graph
|> nodes multiple times.
|>
|> Part of the problem is that I'll have up to 30 or so searchers going.
|> And that is because the only way to know that you've checked enough
|> common revisions is when all common search revisions are ancestors of
|> *all* non-common revisions. So you have to search a large set of the
|> unique revisions (the ones that aren't trivially ancestors of others).
|> [This is better for find_unique_ancestors, but it really hurts
|> find_differences() on the bzr.dev code base].
|>
|> Then again, if everything is fully tested that Robert did, after my
|> changes all the tests still pass.
|>
|> Thoughts?
|> John
|
| I don't think we need two searchers; the searches should perform
| identical query counts if used as they were before my changes. There are
| two lookup strategies - pre result and post-result. It sounds like
| you're using post-result.
|
| Also, I don't understand why you'd ever need more than 3 searchers.
|
| -Rob

Well, your specific changes infiltrated the pre-result section. Specifically you
had:

~    def start_searching(self, revisions):
~        """Add revisions to the search.

~        The parents of revisions will be returned from the next call to next()
~        or next_with_ghosts(). If next_with_ghosts was the most recently used
~        next* call then the return value is the result of looking up the
~        ghost/not ghost status of revisions. (A tuple (present, ghosted)).
~        """
~        revisions = frozenset(revisions)
~        self._started_keys.update(revisions)
~        new_revisions = revisions.difference(self.seen)
~        revs, ghosts, query, parents = self._do_query(revisions)
~        self._stopped_keys.update(ghosts)

^--- _do_query is happening before the if check
~        if self._returning == 'next':
~            self._next_query.update(new_revisions)
~        else:
~            # perform a query on revisions
~            self._current_present.update(revs)
~            self._current_ghosts.update(ghosts)
~            self._next_query.update(query)
~            self._current_parents.update(parents)
~            return revs, ghosts

If I change it to:

if self._returning == 'next':
~  self._next_query.update(new_revisions)
~  self.seen.update(new_revisions)
else:
~  revs, ghosts, ... = self._do_query(revisions)
~  self._stopped_keys.update(ghosts)
~  ...

Then I see a very large performance improvement (90ms => 60ms as mentioned before).

The reason I need more than 3 searchers is because for each unique node, you
only *know* that it is unique when all common search tips are ancestors of the
node. Which holds true for *every* unique node. Otherwise history shortcuts will
make it look like some nodes are unique when they are not.

The example I use in the code is something like:

~  A
~  |
~  B
~ / \
D   E
|\ /|
| F |
| | |
| G |
| | |
| H |
| | |
| I |
| | |
| J |
|/ \|
K   L

If you start a searcher on K and L, you will see:

step 1: K-1  L-2
step 2: J-1&2 D-1 E-2
step 3: I-1&2 B-1&2 # All search tips are "common"

If you stop there, you would consider 'D' and 'E' to be unique to each side,
which is not true. They are actually common nodes, but you have to step through
the rest of F-J to find that out.

So the next step, I find the "trivial" ancestors for the unique nodes. In this
case 'D' and 'E' are direct ancestors of other unique nodes, so are more
interesting.

I then start new searchers at 'D' and 'E', and continue stepping. Because 'B' is
a known ancestor of *both* D and E, I can stop searching there, and I will
continue walking (D,E => B, A, etc) and J until it would be a descendant of both
D and E.

If you want an example with more than 2 searchers, how about:

~    A
~   /|\
~  / B \
~ / / \ \
| D   E |
| |\ /| |
M | F | N
~ \| | |/
~  O G P
~  |\| |
~  | H |
~  | | |
~  | : |
~  | | |
~  | J |
~  |/ \|
~  K   L

Assume that ':' is an arbitrarily long number of revisions. In this case, you
will again find only "common" tips when you reach A and J, and it will look like
~ DMO and ENP are unique to each side. One of those is correct, one is wrong.
You don't *know* if you are correct or not until you have stepped along the
common nodes until they are ancestors of DEMNOP.

Otherwise N or P could be merged into any of F, G, H, etc.

So the problem is determining the intersection of ancestors for all unique
nodes. The best way I've found to do that is to start a BFS at each one, seed it
with the steps I've already taken, and then take the intersection. However,
there are still nodes that lie outside of that intersection, that may eventually
converge. So you still have to walk *all* of them.

The only thing I can think of is that you don't have to walk the actual
intersection nodes in more than one searcher. I do use a rule to kill off
searchers who have the same search tips. I should probably think a bit more
about this side of things, to see if I can't cull them when one of there tips is
already being searched elsewhere. I'm not sure if I could or not.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkgRd54ACgkQJdeBCYSNAAOS8gCgmS9qvr9qPjhOEq85zL47JDqQ
Wm0An3NdYN5BiLtuVL0eVSzb397qY1B/
=pVMk
-----END PGP SIGNATURE-----