[RFC] Non-memoisable BreadthFirstSearcher

Mon Apr 28 04:21:27 BST 2008

On Fri, 2008-04-25 at 01:18 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> | On Thu, 2008-04-24 at 21:39 -0500, John Arbash Meinel wrote:
> |> -----BEGIN PGP SIGNED MESSAGE-----
> |> Hash: SHA1
> |>
> |> I've been working on the find_differences/find_unique_ancestors code,
> |> and I'm finding that quite a few of the changes to _BreadthFirstSearcher
> |> made it slower.
> |>
> |> For example, in "start_searching()" there is a call to _do_query(),
> |> because it wants to know if there are any ghosts.
> |>
> |> However, if I remove that _do_query() call, the average search time
> |> drops from 0.097s => 0.063s.
> |> Or almost 30% of the time is spent in that _do_query().
> |>
> |> Some of that might be how I've structured my code. Specifically, I call
> |> start_searching() with the assumption that any nodes that have have
> |> already been searched will be trivially discarded. (I did that, because
> |> otherwise I have to add another parent graph search to find what nodes
> |> are trivial to discard.)
> |>
> |> Anyway, I can't say that I fully understand Robert's changes, to make
> |> sure that I don't break anything that he was trying to do.

you don't need to start searching nodes that have been returned, because
they have been found already. Why are you calling that?

> |> So my feeling was to write a new searcher, which would just be the bare
> |> minimum. I would also even consider having some aggregate search
> |> objects, because there is a bit of overhead in walking the same graph
> |> nodes multiple times.
> |>
> |> Part of the problem is that I'll have up to 30 or so searchers going.
> |> And that is because the only way to know that you've checked enough
> |> common revisions is when all common search revisions are ancestors of
> |> *all* non-common revisions. So you have to search a large set of the
> |> unique revisions (the ones that aren't trivially ancestors of others).
> |> [This is better for find_unique_ancestors, but it really hurts
> |> find_differences() on the bzr.dev code base].
> |>
> |> Then again, if everything is fully tested that Robert did, after my
> |> changes all the tests still pass.
> |>
> |> Thoughts?
> |> John
> |
> | I don't think we need two searchers; the searches should perform
> | identical query counts if used as they were before my changes. There are
> | two lookup strategies - pre result and post-result. It sounds like
> | you're using post-result.
> |
> | Also, I don't understand why you'd ever need more than 3 searchers.

(Append 'when searching from 2 start points').

> | -Rob
> 
> Well, your specific changes infiltrated the pre-result section. Specifically you
> had:
> 
> ~    def start_searching(self, revisions):
> ~        """Add revisions to the search.
> 
> ~        The parents of revisions will be returned from the next call to next()
> ~        or next_with_ghosts(). If next_with_ghosts was the most recently used
> ~        next* call then the return value is the result of looking up the
> ~        ghost/not ghost status of revisions. (A tuple (present, ghosted)).
> ~        """
> ~        revisions = frozenset(revisions)
> ~        self._started_keys.update(revisions)
> ~        new_revisions = revisions.difference(self.seen)
> ~        revs, ghosts, query, parents = self._do_query(revisions)
> ~        self._stopped_keys.update(ghosts)
> 
> ^--- _do_query is happening before the if check
> ~        if self._returning == 'next':
> ~            self._next_query.update(new_revisions)
> ~        else:
> ~            # perform a query on revisions
> ~            self._current_present.update(revs)
> ~            self._current_ghosts.update(ghosts)
> ~            self._next_query.update(query)
> ~            self._current_parents.update(parents)
> ~            return revs, ghosts

The do_query there is because for post lookups we need to advance once
to line up the results; for pre lookups we update the current state.

> If I change it to:
> 
> if self._returning == 'next':
> ~  self._next_query.update(new_revisions)
> ~  self.seen.update(new_revisions)
> else:
> ~  revs, ghosts, ... = self._do_query(revisions)
> ~  self._stopped_keys.update(ghosts)
> ~  ...
> 
> 
> Then I see a very large performance improvement (90ms => 60ms as mentioned before).
> 
> The reason I need more than 3 searchers is because for each unique node, you
> only *know* that it is unique when all common search tips are ancestors of the
> node. Which holds true for *every* unique node. Otherwise history shortcuts will
> make it look like some nodes are unique when they are not.

Uhm. This isn't correct AFAICT. Think of it as a colouring problem.

> The example I use in the code is something like:
> 
> ~  A
> ~  |
> ~  B
> ~ / \
> D   E
> |\ /|
> | F |
> | | |
> | G |
> | | |
> | H |
> | | |
> | I |
> | | |
> | J |
> |/ \|
> K   L
> 
> If you start a searcher on K and L, you will see:
> 
> step 1: K-1  L-2
> step 2: J-1&2 D-1 E-2
> step 3: I-1&2 B-1&2 # All search tips are "common"
> 
> If you stop there, you would consider 'D' and 'E' to be unique to each side,
> which is not true. They are actually common nodes, but you have to step through
> the rest of F-J to find that out.
> 
> So the next step, I find the "trivial" ancestors for the unique nodes. In this
> case 'D' and 'E' are direct ancestors of other unique nodes, so are more
> interesting.
> 
> I then start new searchers at 'D' and 'E', and continue stepping. Because 'B' is
> a known ancestor of *both* D and E, I can stop searching there, and I will
> continue walking (D,E => B, A, etc) and J until it would be a descendant of both
> D and E.
>
> If you want an example with more than 2 searchers, how about:
> 
>     A
>    /|\
>   / B \
>  / / \ \
> | D   E |
> | |\ /| |
> M | F | N
>  \| | |/
>   O G P
>   |\| |
>   | H |
>   | | |
>   | I |
>   | | |
>   | J |
>   |/ \|
>   K   L
> 
> Assume that ':' is an arbitrarily long number of revisions. In this case, you
> will again find only "common" tips when you reach A and J, and it will look like
> ~ DMO and ENP are unique to each side. One of those is correct, one is wrong.
> You don't *know* if you are correct or not until you have stepped along the
> common nodes until they are ancestors of DEMNOP.
> 
> Otherwise N or P could be merged into any of F, G, H, etc.
>
> So the problem is determining the intersection of ancestors for all unique
> nodes. The best way I've found to do that is to start a BFS at each one, seed it
> with the steps I've already taken, and then take the intersection. However,
> there are still nodes that lie outside of that intersection, that may eventually
> converge. So you still have to walk *all* of them.
> 
> The only thing I can think of is that you don't have to walk the actual
> intersection nodes in more than one searcher. I do use a rule to kill off
> searchers who have the same search tips. I should probably think a bit more
> about this side of things, to see if I can't cull them when one of there tips is
> already being searched elsewhere. I'm not sure if I could or not.

Let me try to work through this. I'm going to assume I can use three
searches to answer questions about two tip nodes. (Because there are
precisely four colours in the full graph: unreachable, LEFT, COMMON and
RIGHT.) Our job is to colour the graph.

Now as I understand the current bug, its that nodes which are common are
reported as unique to one origin. And this occurs when we encounter
nodes from one origin that are only reachable from another origin via
commmon nodes because we stop common revision searches when the searches
from origin have no uncoloured nodes.

Thats too early to stop: we need to keep searching common revisions
until there is no possibility that a previously thought-unique revision
gets coloured common.

I think this is a good minimal test:

    A 
   /|\ 
  / | B
  | |/|
  | C |
  | | |
  | D |
  | | |
  | E |
  |/ \|
  F   G

diff(F,G) should claim that left=F, right=G, common=the rest.

folk wanting to play with this may find the following python bits useful.
small= {"A":"", "B":"A", "C":"AB","D":"C", "E":"D", "F":"AE", "G":"EB"}
john3 = {"A":"","B":"A","D":"B", "E":"B", "M":"A", "F":"DE","N":"A", "O":"MD", "G":"F", "P":"EN", "H":"OG", "I":"H", "J":"H", "K":"OJ", "L":"JP"}
to do a diff:
Graph(DictParentsProvider(small)).find_difference("F", "G")

Now in the small graph, here is the current behaviour (each search is
listed as 'seen', 'searching')
Step | state
0      f = '', 'f'
       g = '', 'g'
       common = '', ''
1      f = 'f', 'ae'
       g = 'g', 'eb'
       common = '', ''
2      f = 'fae', ''
       g = 'geb', 'a'
       common = 'e', 'd'
3      f = 'fae', ''
       g = 'gaeb', ''
       common = 'd', 'c'
and here we stop.

So when should we stop searching? We can't stop based on number of steps
- any heuristic about number of steps can be defeated by adding to the
CDE chain in the small example.

We can stop searching when there is no way for B to change colour from g
to common.
IF we had graph height we could say 'there is no way B can change colour
when all common/f searches are equal or less than B's height'.
Without height, all I have thought of so far is to say 'B can no longer
change colour when all common/f searches have satisfied any of:
 a - reached B (they were below B)
 b - reached the edge (they were above B)
 c - been reached by B (B was above them)
'

This is I think what you are doing. Specifically, you are creating a new
search from each B to answer (c) (but are not specifically covering a)
yet).

Condition b) above is easy to test for - its the current termination
test. Condition a) we don't currently use in determining the ability to
terminate the search. Condition c) is not examined either.

Performance wise though, we only want to query any one revision once if
possible, and more importantly we want to query as many revisions at
once as possible (to maximum parallelism).

Creating new searches is antagonistic to both those performance goals.

In terms of structure, it would be ideal if we would make one and only
one get_parent_map call on each loop, and use the results to update as
many searches as are needed.

To stop searching to the top of history I think we have to implement
tests for both a) and c).

I'm going to stop the written diarrhoea now and get some more
VersionedFiles stuff done :P.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080428/8f2d12d0/attachment.pgp