fetch->streaming fetch

Thu Feb 5 22:52:55 GMT 2009

Andrew and I got together yesterday for an impromptu sprint on the
network streaming fetch.

I'm mainly sending this as a TODO for the resulting branch, but I
figured its of general interest if I also include a few notes of history
- and more eyes are usually good.

So we have infrastructure in/nearly in for streaming hpss calls, but no
call to really do a streaming push/pull. And such a call needs both a
serialisation for the data coming out of VersionedFiles, as well as a
bunch of control logic related to whether we're adding rich roots,
writing to a transactional or non-transaction backend etc.

We don't want to end up with yet-another-fetch-implementation-either.
So, we refactored fetch.py with the goal (partially realised) of having
three distinct components, of which two can be remoted - can be
executing in a seperate process (the smart server), but which will
perform well when we are executing them in process.

We came up with these three components based on discussion of the code
that was in fetch.py:
 - source: an object that can provide streams for the contents of
   revisions, and also for some specific item keys. (e.g. it can 
   provide everything in revision X, or ('inventories', revY) if that 
   specific inventory bytestring is needed for some reason. The source
   is where we ended up doing the analysis for rich root conversion, of
   which roots need adding (because the source repository has the 
   inventories we need to access).

 - sink: an object that consumes streams from the source, and manages
   any state needed around the insertion of the streams into a
   repository. The sink is where we put the parse-and-insert logic for
   upgrading revision and inventory representation. This made sense to
   us, but its entirely possible to put this on the source - or even
   as a general filter which could be on the source or the sink, so that
   we can put it where it works best case by case - e.g. if the bzrlib
   the sink is running in doesn't support the source format, we could
   convert at the source, or vice versa. We haven't added that degree
   of flexibility, but it seems like a fairly clean/clear thing for 
   future work.
   More importantly, the sink handles integrity checking for stacked
   repositories - when doing a push to a smart server stacked branch,
   the server can't see the stacked repository, and the client doesn't
   want to do a massive lookup before pushing, so we figured having
   the server report what basis texts are needed to ensure that any
   text uploaded has its full delta chain available. So a sink is a
   single object that is called into three times in general: once for
   the main transfer of the stream, and once to pickup any needed-basis
   bytestreams, and once to signal that the operation is finished (which
   is a good place to do the autopack check in the smart server too).

 - driver/coordinator/fetcher: responsible for the overall operation,
   this code performs the setup of the source and sink objects, holds
   the progress bar, and does pretty much everything not directly
   related to extracting data from the source or inserting it into the
   target.

Now, we got a nice looking sink in place, but haven't made a really
clear source; this probably doesn't matter right now, but to get
streaming push *and* pull we'll need both sides clearly demarcated.

We had some must-to TODO's remaining before we can consider a merge
request.

Firstly, we need a consistent way to provide the sink with the
repository format object to use when it parses a revision/inventory -
this is for the conversion filtering logic. We started by just passing
in the repository format object, but that clearly isn't sufficient when
we start talking about a wire protocol (to move the sink to the smart
server). We considered the repository format string, but many
repositories are only versioned on their control dir (e.g. git, svn, hg,
pre-metabzrdir). So we came up with a couple of options; either add a
method to RepositoryFormat, to get a 'network-name' for the format, and
make sure there is a registry that looks up by that name, or do the same
thing to the serializer objects. 

Secondly, we need to do a bunch of acceptance level testing and make
sure the progress bars related to this are working cleanly, because the
layering change to streams has almost certainly added some confusion :).

Thats all I remember, hope this has been useful for folk interesting in
this part of the code base.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090206/26558066/attachment.pgp