[MERGE] Batch get_parent_map HPSS calls made from InterRepo._walk_to_common_revisions

Andrew Bennetts andrew at canonical.com
Fri Oct 3 00:33:09 BST 2008


John Arbash Meinel wrote:
[...]
> 
> > So maybe we should be returning some data for HPSS's get_parent_map if
> > the answer would be empty.  As a hypothetical idea, if the remote repo
> > could efficiently generate a list of heads, then returning those (or
> > subset of them up to some fixed limit?) would be useful.
> > _walk_to_common_revisions could in principle be implemented as:
> 
> I'll note that Robert did work a while ago on searching local and remote
> graphs to find areas of overlap. I don't know where that ended, but it
> was certainly something we discussed when I was in Sydney last (2 yrs ago?)

Yes, we're finally getting towards the point where we might start using
those ideas :)

[...]
> I think we could simply improve the _walk logic, possibly only under
> HPSS conditions, but perhaps for all cases.
> 
> I believe the logic he outlined back then was about logarithmic scaling.
> So you would query for tip, tip-1, tip-2, tip-4, tip-8, tip-16, tip-32,
> etc. You could certainly cap that at 50 revisions which would be 2^50
> and still only request 50 revision ids and always either know that you
> have 0 overlap or the server could find something to give back.
> 
> And if it hit between tip-16 and tip-32, it could fan forward inbetween
> there.

Hmm, that's interesting.  The fan forward for HPSS could simply be a
straight get_parent_map of everything between tip-16 and tip-32; not
completely optimal, but easy.

Ideally with HPSS if you sent a request for a sequence like that, say
[tip, tip-1, tip-2, tip-4, tip-8, ...], then you'd only need a reply
that says what the first “hit” was.  i.e. if the 3rd graph key specified
was found, then the server can stop looking for the other 47 and just
return immediately.  So perhaps a new RPC would be good (eventually),
ideally something that encodes what the branching points are (e.g. if
there a multiple parents at tip-3 then there will be two tip-4 revisions
to look for).  With that RPC, the fan forward would be a follow up call
to that RPC that just queried every individual revision between the last
non-hit and the first hit.

The flip side to walking back 2^50 revisions is that the client will end
up reading basically the entire source revision graph!  Well, if there's
a lot of branching in the graph and the total number of revisions in a
request is still capped to 50 then perhaps not, but that still seems a
bit extreme.

> +            next_revs = set()
> +            while len(next_revs) <
> self._walk_to_common_revisions_batch_size:
> +                try:
> +                    next_revs_part, ghosts = searcher.next_with_ghosts()
> +                    next_revs.update(next_revs_part)
> +                except StopIteration:
> +                    run_out_of_next_revs = True
> +                    break
> +                if revision_ids.intersection(ghosts):
> +                    absent_ids = set(revision_ids.intersection(ghosts))
> +                    # If all absent_ids are present in target, no error is
> +                    # needed.
> +                    absent_ids.difference_update(
> +                        set(target_graph.get_parent_map(absent_ids)))
> +                    if absent_ids:
> +                        raise errors.NoSuchRevision(
> +                            self.source, absent_ids.pop())
> 
> 
> ^- The "if revision_ids.intersection(ghosts)" looks more like something
> that should be outside the while loop, rather than inside. But maybe I'm
> wrong.

Hmm.  You're right, the ghosts check can be batched up too without
changing the result.  Done.

...But, thinking about it more carefully, either way this patch may
behave differently when there are ghosts.

Consider the case where the remote side (target) has this graph:

A
|\
| \
B (ghost of C)

And the local side (source) has this graph:

N
|
|
A
|\
| \
B (ghost of C)

bzr.dev's code will first look for N, then it will look for A, and then
it will stop.

My patch will look for all of N, A, B and C on the first iteration, and
fail because C is not present in the source nor in the target.  I think
this is a genuine problem in my code, although the test suite doesn't
catch it.

-Andrew.




More information about the bazaar mailing list