[MERGE] Batch get_parent_map HPSS calls made from InterRepo._walk_to_common_revisions
Andrew Bennetts
andrew at canonical.com
Wed Oct 1 12:49:20 BST 2008
Currently, InterRepo._walk_to_common_revisions calls get_parent_map with
as few as one revision ID per iteration. This behaviour is fairly easy
to encounter with typical branches: if a branch being pushed has a
simple linear history of say 20 new revisions and no merges, then it
will make 20 get_parent_map calls. (If there are merges then the
breadth-first searcher will query more at a time, but still possibly
quite a low number.)
This isn't a big deal with bzr.dev as it is today, because
get_parent_map here will never be called on a RemoteRepository, only
ever on the concrete underlying (i.e. VFS) repository, and the VFS code
for pack-format get_parent_map tends to fetch a large fraction of the
remote index files via readv while bisecting for a key, and it caches
that data to speed later lookups.
With the (already approved, but not yet landed) patch to make
InterOtherToRemoteRepository actually use
RemoteRepository.get_parent_map, the situation changes. An HPSS
get_parent_map call for a single absent revision can do nothing but
return an empty result. So _walk_to_common_revisions on a linear
history of 20 new revisions will need 20 round-trips to find a common
revision. Longer histories will be worse.
So here's a patch that causes InterOtherToRemoteRepository's
_walk_to_common_revisions to batch up revision IDs for get_parent_map
calls until there's at least 50 revisions (or the source graph has run
out of revisions!). 50 is a number I've pulled out of thin air; it
seemed like an ok compromise between reducing round-trips vs. causing
the server (and client) to do extra work. I intend to do some
measurements with very large branches to see if there's a better value.
This patch doesn't affect behaviour for non-RemoteRepository
_walk_to_common_revisions. It would be wasteful to cause more index
bisection to happen in those cases.
The funny thing about the HPSS get_parent_map vs. VFS-based pack
get_parent_map is that for HPSS, querying for absent revisions doesn't
return any data at all to the client, but the VFS one will have
retrieved a significant fraction of the remote indices in the course of
bisecting for a negative answer. So while get_parent_map for HPSS is
clearly more optimal than via VFS in the case where the revision exists
(round-trips don't increase as the size of the remote repo increases,
let alone as the number of pack files increases), it's actually at a
disadvantage for negative results. And when pushing, negative results
during _walk_to_common_revisions are the common case.
So maybe we should be returning some data for HPSS's get_parent_map if
the answer would be empty. As a hypothetical idea, if the remote repo
could efficiently generate a list of heads, then returning those (or
subset of them up to some fixed limit?) would be useful.
_walk_to_common_revisions could in principle be implemented as:
remote_heads = remote_graph.heads()
searcher = source_graph._make_breadth_first_searcher(source_tip)
for next_revs in searcher:
common_revisions = next_revs.intersection(remote_heads)
if common_revisions:
return common_revisions
(simplified to ignore ghosts)
But even returning some random revisions (say the ones that the remote
repo happened to open anyway in the course of doing the server-side
bisection) would help. We should do something; get_parent_map via HPSS
really shouldn't be at a disadvantage to a dumber method.
In the meantime, this patch helps.
-Andrew.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: batch-get_parent_map-3732.patch
Type: text/x-diff
Size: 5943 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081001/b24fc91c/attachment.bin
More information about the bazaar
mailing list