[MERGE] Add optional 'fetch_spec' to argument to Repository.fetch

Thu Mar 5 14:37:52 GMT 2009

The current fetch API only accepts a last_revision argument to control what
revisions need fetching.  Mostly this works well, but when the target repository
was just created it's rather wasteful, as the fetch code queries the entire
history of the source in a vain effort to find some common history.  The larger
the branch we're sprouting/cloning, the more effort we waste.  This shows up in
HPSS conversations as a bunch of get_parent_map calls (or rix file reads),
roughly proportional to the size of history.

So this patch adds a new optional argument to fetch, “fetch_spec”, as an
alternative to last_revision.  A fetch_spec is a SearchResult (or similar, see
below) object, and if given the fetch code will simply fetch the revisions it
specifies rather than determining the revisions to fetch from scratch.

However, generating a valid SearchResult that can be serialised over HPSS still
requires iterating all of the history, because it's “recipe” includes the number
of keys found by the search as a cross-check.  So I've added a
“MiniSearchResult”, which is extremely simple-minded: it just takes a single
key, and is interpreted to mean the full ancestry of that key, all the way back
to NULL_REVISION.  The Repository.get_stream RPC has been updated to know how to
serialise and deserialise MiniSearchResults.

So now BzrDir's clone and sprout code, when it creates a new repository to put a
branch in, will instruct fetch to transfer the full ancestry of the branch, no
questions asked.

The end result is a reduction in round-trips over the network, especially when
branching from a server to a standalone branch: now the client finds the remote
branch's revision_id and then simply asks for the full ancestry of that revision
without first wasting time finding out what that history is.  I wouldn't be
surprised to find a small performance benefit for local standalone branching
too.

Be warned that this *changes* the protocol for the Repository.get_stream RPC
incompatibly.  The RPC has only been in bzr.dev a few days, but if you run
bzr.dev... be warned!

Some open questions:

 * “fetch_spec” is a poor name for the new parameter.  I'm open to alternative
   suggestions.  Robert suggested “search” on IRC, although I'm not sure if
   that's a strictly accurate label.  Does a graph “search” accurately describe
   this concept, or is the overlap just coincidental?

 * MiniSearchResult has a poor name too.  It may also be a bit *too* simple.
   It might be useful to have the flexibility of SearchResult's start_keys and
   exclude_keys sets while not requiring its key count.  Similarly the network
   serialisation perhaps should be more flexible.  

 * It also smells a little that I have to pass the repository to
   MiniSearchResult's constructor so that its get_keys() method can work;
   perhaps the description of what keys to get and the logic to generate keys
   from that description should be teased apart?

 * Like the current fetch code, this doesn't take care to fetch the revisions
   specified in the branch's tags, but it probably should.  That's an argument
   in favour of a richer fetch_spec than my MiniSearchResult, because that would
   need multiple start_keys.

Despite the questions, this is actually a pretty clean change, so don't be put
off too much :)

I've submitted this against my less-skips patch, because I needed the extra test
coverage while developing it.

-Andrew.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fetch-with-search-4082.patch
Type: text/x-diff
Size: 38773 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090306/9d32307a/attachment-0001.bin