[RFC] Non-memoisable BreadthFirstSearcher

Mon Apr 28 05:17:07 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:

...

>> ~        The parents of revisions will be returned from the next call to next()
>> ~        or next_with_ghosts(). If next_with_ghosts was the most recently used
>> ~        next* call then the return value is the result of looking up the
>> ~        ghost/not ghost status of revisions. (A tuple (present, ghosted)).
>> ~        """
>> ~        revisions = frozenset(revisions)
>> ~        self._started_keys.update(revisions)
>> ~        new_revisions = revisions.difference(self.seen)
>> ~        revs, ghosts, query, parents = self._do_query(revisions)
>> ~        self._stopped_keys.update(ghosts)
>>
>> ^--- _do_query is happening before the if check
>> ~        if self._returning == 'next':
>> ~            self._next_query.update(new_revisions)
>> ~        else:
>> ~            # perform a query on revisions
>> ~            self._current_present.update(revs)
>> ~            self._current_ghosts.update(ghosts)
>> ~            self._next_query.update(query)
>> ~            self._current_parents.update(parents)
>> ~            return revs, ghosts
> 
> The do_query there is because for post lookups we need to advance once
> to line up the results; for pre lookups we update the current state.
> 

Except in the "pre lookup" case you are still placing a call to
"get_parent_map()" and then ignoring the return value.

If this was your reasoning, then I think changing it to:
if self._returning == 'next':
  self._next_query.update(new_revisions)
  self.seen.update(new_revision)
...

is valid. Since 'self.seen' is the only state updated by _do_query, and
you didn't say that maintaining self._stopped_keys was important for
_returning == 'next'.

...

> Let me try to work through this. I'm going to assume I can use three
> searches to answer questions about two tip nodes. (Because there are
> precisely four colours in the full graph: unreachable, LEFT, COMMON and
> RIGHT.) Our job is to colour the graph.

I believed you worked through it correctly, the problem is that you seem
to have come to the same conclusion I did. Other than to say it is
ideally not the right solution.

> 
> Now as I understand the current bug, its that nodes which are common are
> reported as unique to one origin. And this occurs when we encounter
> nodes from one origin that are only reachable from another origin via
> commmon nodes because we stop common revision searches when the searches
> from origin have no uncoloured nodes.
> 
> Thats too early to stop: we need to keep searching common revisions
> until there is no possibility that a previously thought-unique revision
> gets coloured common.

Correct.

> 
> I think this is a good minimal test:
> 
> 
>     A 
>    /|\ 
>   / | B
>   | |/|
>   | C |
>   | | |
>   | D |
>   | | |
>   | E |
>   |/ \|
>   F   G
> 
> diff(F,G) should claim that left=F, right=G, common=the rest.
> 
> folk wanting to play with this may find the following python bits useful.
> small= {"A":"", "B":"A", "C":"AB","D":"C", "E":"D", "F":"AE", "G":"EB"}
> john3 = {"A":"","B":"A","D":"B", "E":"B", "M":"A", "F":"DE","N":"A", "O":"MD", "G":"F", "P":"EN", "H":"OG", "I":"H", "J":"H", "K":"OJ", "L":"JP"}
> to do a diff:
> Graph(DictParentsProvider(small)).find_difference("F", "G")
> 
> Now in the small graph, here is the current behaviour (each search is
> listed as 'seen', 'searching')
> Step | state
> 0      f = '', 'f'
>        g = '', 'g'
>        common = '', ''
> 1      f = 'f', 'ae'
>        g = 'g', 'eb'
>        common = '', ''
> 2      f = 'fae', ''
>        g = 'geb', 'a'
>        common = 'e', 'd'
> 3      f = 'fae', ''
>        g = 'gaeb', ''
>        common = 'd', 'c'
> and here we stop.
> 
> So when should we stop searching? We can't stop based on number of steps
> - any heuristic about number of steps can be defeated by adding to the
> CDE chain in the small example.
> 
> We can stop searching when there is no way for B to change colour from g
> to common.
> IF we had graph height we could say 'there is no way B can change colour
> when all common/f searches are equal or less than B's height'.

Correct. You and Aaron have discussed a "Greatest Distance From Origin"
sort of field here.

> Without height, all I have thought of so far is to say 'B can no longer
> change colour when all common/f searches have satisfied any of:
>  a - reached B (they were below B)
>  b - reached the edge (they were above B)
>  c - been reached by B (B was above them)
> '
> 
> This is I think what you are doing. Specifically, you are creating a new
> search from each B to answer (c) (but are not specifically covering a)
> yet).

(a) is covered, in as much as when you reach a node that is common, it
asks the searcher for something like:

  nodes = searcher.find_seen_ancestors(common_nodes)
  searcher.stop_searching_any(nodes)

  common_searcher.start_searching(nodes)

> 
> Condition b) above is easy to test for - its the current termination
> test. Condition a) we don't currently use in determining the ability to
> terminate the search. Condition c) is not examined either.
> 
> 
> Performance wise though, we only want to query any one revision once if
> possible, and more importantly we want to query as many revisions at
> once as possible (to maximum parallelism).
> 
> Creating new searches is antagonistic to both those performance goals.
> 
> In terms of structure, it would be ideal if we would make one and only
> one get_parent_map call on each loop, and use the results to update as
> many searches as are needed.

Well, in my early work I added a _BFS.step(parent_map) call, and a
'will_search() => nodes' calls so that I could batch up the requests.
When I was testing it on heads() it seemed to make things *worse*. I
think now it might have been that I was being too eager at stepping
forward, which was causing the queries to search more nodes than they
had to.

There are some other bits that might interest you. Consider your graph
above, with a small change:

     A
    /|\
   | | B'
   | | |
   | | B
   | |/|
   | C |
   | | |
   | D |
   | | |
   | E |
   |/ \|
   F   G

In this particular case, both B and B' are in the pseudo-unique set.
(Assuming CDE is arbitrarily long). However, you know that any ancestors
of B are going to be a superset of B', so you don't have to worry about
B. I use a quick traversal over the parent map of all pseudo-unique
nodes to throw out any that have parents in the set.
(_remove_simple_descendants()).

Ancestors which are common to all pseudo-unique nodes can be searched by
an independent searcher, rather than being searched N times. This also
helps cause searchers to run out of nodes, and thus be culled. I think
the total number of steps is the same, but there is certainly less work
being done.

This is not in the patch I sent, as I did it after our last email
exchange. I haven't submitted it yet, because in adding it, I introduced
the race condition. Which I have fixed, but I wanted to actually have a
test case for the condition, to prevent regressions. (It is a graph-race
condition, not a timing related one, so it should be reasonable to do,
just hard for me to find the right graph to trigger it so far.)

> 
> To stop searching to the top of history I think we have to implement
> tests for both a) and c).
> 
> I'm going to stop the written diarrhoea now and get some more
> VersionedFiles stuff done :P.
> 
> -Rob
>

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFIFU/DJdeBCYSNAAMRAtUtAKCJhNP+7U6yG1r6xmPrb7VjOa10JACUCAcu
85YK4/nGZ6Ndgcr3OmaY2Q==
=Ui2L
-----END PGP SIGNATURE-----