fetch->streaming fetch
Robert Collins
robertc at robertcollins.net
Thu Feb 5 22:52:55 GMT 2009
Andrew and I got together yesterday for an impromptu sprint on the
network streaming fetch.
I'm mainly sending this as a TODO for the resulting branch, but I
figured its of general interest if I also include a few notes of history
- and more eyes are usually good.
So we have infrastructure in/nearly in for streaming hpss calls, but no
call to really do a streaming push/pull. And such a call needs both a
serialisation for the data coming out of VersionedFiles, as well as a
bunch of control logic related to whether we're adding rich roots,
writing to a transactional or non-transaction backend etc.
We don't want to end up with yet-another-fetch-implementation-either.
So, we refactored fetch.py with the goal (partially realised) of having
three distinct components, of which two can be remoted - can be
executing in a seperate process (the smart server), but which will
perform well when we are executing them in process.
We came up with these three components based on discussion of the code
that was in fetch.py:
- source: an object that can provide streams for the contents of
revisions, and also for some specific item keys. (e.g. it can
provide everything in revision X, or ('inventories', revY) if that
specific inventory bytestring is needed for some reason. The source
is where we ended up doing the analysis for rich root conversion, of
which roots need adding (because the source repository has the
inventories we need to access).
- sink: an object that consumes streams from the source, and manages
any state needed around the insertion of the streams into a
repository. The sink is where we put the parse-and-insert logic for
upgrading revision and inventory representation. This made sense to
us, but its entirely possible to put this on the source - or even
as a general filter which could be on the source or the sink, so that
we can put it where it works best case by case - e.g. if the bzrlib
the sink is running in doesn't support the source format, we could
convert at the source, or vice versa. We haven't added that degree
of flexibility, but it seems like a fairly clean/clear thing for
future work.
More importantly, the sink handles integrity checking for stacked
repositories - when doing a push to a smart server stacked branch,
the server can't see the stacked repository, and the client doesn't
want to do a massive lookup before pushing, so we figured having
the server report what basis texts are needed to ensure that any
text uploaded has its full delta chain available. So a sink is a
single object that is called into three times in general: once for
the main transfer of the stream, and once to pickup any needed-basis
bytestreams, and once to signal that the operation is finished (which
is a good place to do the autopack check in the smart server too).
- driver/coordinator/fetcher: responsible for the overall operation,
this code performs the setup of the source and sink objects, holds
the progress bar, and does pretty much everything not directly
related to extracting data from the source or inserting it into the
target.
Now, we got a nice looking sink in place, but haven't made a really
clear source; this probably doesn't matter right now, but to get
streaming push *and* pull we'll need both sides clearly demarcated.
We had some must-to TODO's remaining before we can consider a merge
request.
Firstly, we need a consistent way to provide the sink with the
repository format object to use when it parses a revision/inventory -
this is for the conversion filtering logic. We started by just passing
in the repository format object, but that clearly isn't sufficient when
we start talking about a wire protocol (to move the sink to the smart
server). We considered the repository format string, but many
repositories are only versioned on their control dir (e.g. git, svn, hg,
pre-metabzrdir). So we came up with a couple of options; either add a
method to RepositoryFormat, to get a 'network-name' for the format, and
make sure there is a registry that looks up by that name, or do the same
thing to the serializer objects.
Secondly, we need to do a bunch of acceptance level testing and make
sure the progress bars related to this are working cleanly, because the
layering change to streams has almost certainly added some confusion :).
Thats all I remember, hope this has been useful for folk interesting in
this part of the code base.
-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090206/26558066/attachment.pgp
More information about the bazaar
mailing list