[MERGE] Batch get_parent_map HPSS calls made from InterRepo._walk_to_common_revisions

John Arbash Meinel john at arbash-meinel.com
Wed Oct 8 20:56:53 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John Arbash Meinel wrote:
> [...]
>>> So maybe we should be returning some data for HPSS's get_parent_map if
>>> the answer would be empty.  As a hypothetical idea, if the remote repo
>>> could efficiently generate a list of heads, then returning those (or
>>> subset of them up to some fixed limit?) would be useful.
>>> _walk_to_common_revisions could in principle be implemented as:
>> I'll note that Robert did work a while ago on searching local and remote
>> graphs to find areas of overlap. I don't know where that ended, but it
>> was certainly something we discussed when I was in Sydney last (2 yrs ago?)
> 
> Yes, we're finally getting towards the point where we might start using
> those ideas :)
> 
> [...]
>> I think we could simply improve the _walk logic, possibly only under
>> HPSS conditions, but perhaps for all cases.
>>
>> I believe the logic he outlined back then was about logarithmic scaling.
>> So you would query for tip, tip-1, tip-2, tip-4, tip-8, tip-16, tip-32,
>> etc. You could certainly cap that at 50 revisions which would be 2^50
>> and still only request 50 revision ids and always either know that you
>> have 0 overlap or the server could find something to give back.
>>
>> And if it hit between tip-16 and tip-32, it could fan forward inbetween
>> there.
> 
> Hmm, that's interesting.  The fan forward for HPSS could simply be a
> straight get_parent_map of everything between tip-16 and tip-32; not
> completely optimal, but easy.
> 
> Ideally with HPSS if you sent a request for a sequence like that, say
> [tip, tip-1, tip-2, tip-4, tip-8, ...], then you'd only need a reply
> that says what the first “hit” was.  i.e. if the 3rd graph key specified
> was found, then the server can stop looking for the other 47 and just
> return immediately.  So perhaps a new RPC would be good (eventually),
> ideally something that encodes what the branching points are (e.g. if
> there a multiple parents at tip-3 then there will be two tip-4 revisions
> to look for).  With that RPC, the fan forward would be a follow up call
> to that RPC that just queried every individual revision between the last
> non-hit and the first hit.
> 
> The flip side to walking back 2^50 revisions is that the client will end
> up reading basically the entire source revision graph!  Well, if there's
> a lot of branching in the graph and the total number of revisions in a
> request is still capped to 50 then perhaps not, but that still seems a
> bit extreme.

Sure. But we don't have to do 2^50. Also, we can keep the exponent we
are on right now. So you start by sending 5 revisions: "2^0, 2^1, 2^2,
2^3". If that fails, you go to "2^4, 2^5, 2^6, 2^7", etc.
(1, 2, 4, 8; 16, 32, 64, 128; etc)

You still end up back at 128 nodes after only 2 round trips. Also,
arguably we could just cap it at 2^10 = 1024, and just request every
1,000th node from there on back.


> 
>> +            next_revs = set()
>> +            while len(next_revs) <
>> self._walk_to_common_revisions_batch_size:
>> +                try:
>> +                    next_revs_part, ghosts = searcher.next_with_ghosts()
>> +                    next_revs.update(next_revs_part)
>> +                except StopIteration:
>> +                    run_out_of_next_revs = True
>> +                    break
>> +                if revision_ids.intersection(ghosts):
>> +                    absent_ids = set(revision_ids.intersection(ghosts))
>> +                    # If all absent_ids are present in target, no error is
>> +                    # needed.
>> +                    absent_ids.difference_update(
>> +                        set(target_graph.get_parent_map(absent_ids)))
>> +                    if absent_ids:
>> +                        raise errors.NoSuchRevision(
>> +                            self.source, absent_ids.pop())
>>
>>
>> ^- The "if revision_ids.intersection(ghosts)" looks more like something
>> that should be outside the while loop, rather than inside. But maybe I'm
>> wrong.
> 
> Hmm.  You're right, the ghosts check can be batched up too without
> changing the result.  Done.
> 
> ...But, thinking about it more carefully, either way this patch may
> behave differently when there are ghosts.
> 
> Consider the case where the remote side (target) has this graph:
> 
> A
> |\
> | \
> B (ghost of C)
> 
> And the local side (source) has this graph:
> 
> N
> |
> |
> A
> |\
> | \
> B (ghost of C)
> 
> bzr.dev's code will first look for N, then it will look for A, and then
> it will stop.
> 
> My patch will look for all of N, A, B and C on the first iteration, and
> fail because C is not present in the source nor in the target.  I think
> this is a genuine problem in my code, although the test suite doesn't
> catch it.
> 
> -Andrew.

I think you are misunderstanding 'revision_ids', though I need to look
at your patch closely. It is meant as the *revisions supplied by the
caller*, and if any of them are missing, it will error. It is not meant
as "all revisions in ancestry must be present".

I'm pretty sure your patch doesn't change this, because you *never
mutate* the supplied value.

So I think moving it outside the loop is fine, and I don't think the new
code is wrong in any way.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjtEIUACgkQJdeBCYSNAAN4CwCfQvSZ+HZCzZwIXh0A9kuUuMrs
0eoAoJMBeM5knkZKbFqG7G9qbylxflTc
=Umo+
-----END PGP SIGNATURE-----



More information about the bazaar mailing list