[MERGE] Batch get_parent_map HPSS calls made from InterRepo._walk_to_common_revisions

Andrew Bennetts andrew at canonical.com
Wed Oct 1 12:49:20 BST 2008


Currently, InterRepo._walk_to_common_revisions calls get_parent_map with
as few as one revision ID per iteration.  This behaviour is fairly easy
to encounter with typical branches: if a branch being pushed has a
simple linear history of say 20 new revisions and no merges, then it
will make 20 get_parent_map calls.  (If there are merges then the
breadth-first searcher will query more at a time, but still possibly
quite a low number.)

This isn't a big deal with bzr.dev as it is today, because
get_parent_map here will never be called on a RemoteRepository, only
ever on the concrete underlying (i.e. VFS) repository, and the VFS code
for pack-format get_parent_map tends to fetch a large fraction of the
remote index files via readv while bisecting for a key, and it caches
that data to speed later lookups.

With the (already approved, but not yet landed) patch to make
InterOtherToRemoteRepository actually use
RemoteRepository.get_parent_map, the situation changes.  An HPSS
get_parent_map call for a single absent revision can do nothing but
return an empty result.  So _walk_to_common_revisions on a linear
history of 20 new revisions will need 20 round-trips to find a common
revision.  Longer histories will be worse.

So here's a patch that causes InterOtherToRemoteRepository's
_walk_to_common_revisions to batch up revision IDs for get_parent_map
calls until there's at least 50 revisions (or the source graph has run
out of revisions!).  50 is a number I've pulled out of thin air; it
seemed like an ok compromise between reducing round-trips vs. causing
the server (and client) to do extra work.  I intend to do some
measurements with very large branches to see if there's a better value.

This patch doesn't affect behaviour for non-RemoteRepository
_walk_to_common_revisions.  It would be wasteful to cause more index
bisection to happen in those cases.

The funny thing about the HPSS get_parent_map vs. VFS-based pack
get_parent_map is that for HPSS, querying for absent revisions doesn't
return any data at all to the client, but the VFS one will have
retrieved a significant fraction of the remote indices in the course of
bisecting for a negative answer.  So while get_parent_map for HPSS is
clearly more optimal than via VFS in the case where the revision exists
(round-trips don't increase as the size of the remote repo increases,
let alone as the number of pack files increases), it's actually at a
disadvantage for negative results.  And when pushing, negative results
during _walk_to_common_revisions are the common case.

So maybe we should be returning some data for HPSS's get_parent_map if
the answer would be empty.  As a hypothetical idea, if the remote repo
could efficiently generate a list of heads, then returning those (or
subset of them up to some fixed limit?) would be useful.
_walk_to_common_revisions could in principle be implemented as:

    remote_heads = remote_graph.heads()
    searcher = source_graph._make_breadth_first_searcher(source_tip)
    for next_revs in searcher:
        common_revisions = next_revs.intersection(remote_heads)
        if common_revisions:
           return common_revisions
    
(simplified to ignore ghosts)

But even returning some random revisions (say the ones that the remote
repo happened to open anyway in the course of doing the server-side
bisection) would help.  We should do something; get_parent_map via HPSS
really shouldn't be at a disadvantage to a dumber method.

In the meantime, this patch helps.

-Andrew.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: batch-get_parent_map-3732.patch
Type: text/x-diff
Size: 5943 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081001/b24fc91c/attachment.bin 


More information about the bazaar mailing list