Rev 3351: Draft up an interface for repository streams that is more capable than the in http://people.ubuntu.com/~robertc/baz2.0/versioned_files
Robert Collins
robertc at robertcollins.net
Fri Apr 11 02:07:00 BST 2008
At http://people.ubuntu.com/~robertc/baz2.0/versioned_files
------------------------------------------------------------
revno: 3351
revision-id: robertc at robertcollins.net-20080411010629-j07mncp10h10obg8
parent: pqm at pqm.ubuntu.com-20080409233555-n26cmi0y1hb98tf6
committer: Robert Collins <robertc at robertcollins.net>
branch nick: data_stream_revamp
timestamp: Fri 2008-04-11 11:06:29 +1000
message:
Draft up an interface for repository streams that is more capable than the
current one.
added:
doc/developers/repository-stream.txt repositorystream.txt-20080410222511-nh6b9bvscvcerh48-1
modified:
doc/developers/index.txt index.txt-20070508041241-qznziunkg0nffhiw-1
=== modified file 'doc/developers/index.txt'
--- a/doc/developers/index.txt 2008-04-08 07:46:55 +0000
+++ b/doc/developers/index.txt 2008-04-11 01:06:29 +0000
@@ -27,6 +27,9 @@
* `Container format <container-format.html>`_ |--| Notes on a container format
for streaming and storing Bazaar data.
+* `Repository stream <repository-stream.html>`_ |--| Notes on streaming data
+ for repositories (a layer above the container format).
+
* `Indices <indices.html>`_ |--| The index facilities available within bzrlib.
* `Inventories <inventory.html>`_ |--| Tree shape abstraction.
=== added file 'doc/developers/repository-stream.txt'
--- a/doc/developers/repository-stream.txt 1970-01-01 00:00:00 +0000
+++ b/doc/developers/repository-stream.txt 2008-04-11 01:06:29 +0000
@@ -0,0 +1,194 @@
+==================
+Repository Streams
+==================
+
+Status
+======
+
+:Date: 2008-04-11
+
+This document describes the proposed programming interface for streaming
+data from and into repositories. This programming interface should allow
+a single interface for pulling data from and inserting data into a Bazaar
+repository.
+
+.. contents::
+
+
+Motivation
+==========
+
+To eliminate the current requirement that extracting data from a
+repository requires either using a slow format, or knowing the format of
+both the source repository and the target repository.
+
+
+Use Cases
+=========
+
+Here's a brief description of use cases this interface is intended to
+support.
+
+Fetch operations
+----------------
+
+We fetch data between repositories as part of push/pull/branch operations.
+Fetching data is currently an very interactive process with lots of
+requests. For performance having the data be supplied in a stream will
+improve push and pull to remote servers. For purely local operations the
+streaming logic should help reduce memory pressure. In fetch operations
+we always know the formats of both the source and target.
+
+Smart server operations
+~~~~~~~~~~~~~~~~~~~~~~~
+
+With the smart server we support one streaming format, but this is only
+usable when both the client and server have the same model of data, and
+requires non-optimal IO ordering for pack to pack operations. Ideally we
+can
+
+Bundles
+-------
+
+Bundles also create a stream of data for revisions from a repository.
+Unlike fetch operations we do not know the format of the target at the
+time the stream is created. It would be good to be able to treat bundles
+as frozen branches and repositories, so a serialised stream should be
+suitable for this.
+
+Data conversion
+---------------
+
+At this point we are not trying to integrate data conversion into this
+interface, though it is likely possible.
+
+
+Characteristics
+===============
+
+Some key aspects of the described interface are discussed in this section.
+
+Single round trip
+-----------------
+
+All users of this should be able to create an appropriate stream from a
+single round trip.
+
+Forward-only reads
+------------------
+
+There should be no need to seek in a stream when inserting data from it
+into a repository. This places an ordering constraint on streams which
+some repositories do not need.
+
+
+Serialisation
+=============
+
+At this point serialisation of a repository stream has not been specified.
+Some considerations to bear in mind about serialisation are worth noting
+however.
+
+Weaves
+------
+
+While there shouldn't be too many users of weave repositories anymore,
+avoiding pathological behaviour when a weave is being read is a good idea.
+Having the weave itself embedded in the stream is very straight forward
+and does not need expensive on the fly extraction and re-diffing to take
+place.
+
+Bundles
+-------
+
+Being able to perform random reads from a repository stream which is a
+bundle would allow stacking a bundle and a real repository together. This
+will need the pack container format to be used in such a way that we can
+avoid reading more data than needed within the pack container's readv
+interface.
+
+
+Specification
+=============
+
+This describes the interface for requesting a stream, and the programming
+interface a stream must provide. Streams that have been serialised should
+expose the same interface.
+
+Requesting a stream
+-------------------
+
+To request a stream, three parameters are needed:
+
+ * A revision search to select the revisions to include.
+ * A data ordering flag. There are two values for this - 'unordered' and
+ 'topological'. 'unordered' streams are useful when inserting into
+ repositories that have the ability to perform atomic insertions.
+ 'topological' streams are useful when converting data, or when
+ inserting into repositories that cannot perform atomic insertions (such
+ as knit or weave based repositories).
+ * A complete_inventory flag. When provided this flag signals the stream
+ generator to include all the data needed to construct the inventory of
+ each revision included in the stream, rather than just deltas. This is
+ useful when converting data from a repository with a different
+ inventory serialisation, as pure deltas would not be able to be
+ reconstructed.
+
+
+Structure of a stream
+---------------------
+
+A stream is an object. It can be consistency checked via the ``check``
+method (which consumes the stream). The ``iter_contents`` method can be
+used to iterate the contents of the stream. The contents of the stream are
+a series of top level records, each of which contains one or more
+bytestrings (potentially as a delta against another item in the
+repository) and some optional metadata.
+
+
+Consuming a stream
+------------------
+
+To consume a stream, obtain an iterator from the streams
+``iter_contents`` method. This iterator will yield the top level records.
+Each record has two attributes. One is ``key_prefix`` which is a tuple key
+prefix for the names of each of the bytestrings in the record. The other
+attribute is ``entries``, an iterator of the individual items in the
+record. Each item that the iterator yields is a two-tuple with a meta-data
+dict and the compressed bytestring data.
+
+In pseudocode::
+
+ stream = repository.get_repository_stream(search, UNORDERED, False)
+ for record in stream.iter_contents():
+ for metadata, bytes in record.entries:
+ print "Object %s, compression type %s, %d bytes long." % (
+ record.key_prefix + metadata['key'],
+ metadata['storage_kind'], len(bytes))
+
+This structure should allow stream adapters to be written which can coerce
+all records to the type of compression that a particular client needs. For
+instance, inserting into weaves requires fulltexts, so an adapter that
+applies knit records and extracts them to fulltexts will avoid weaves
+needing to know about all potential storage kinds. Likewise, inserting
+into knits would use an adapter that gives everything as either matching
+knit records or full texts.
+
+bytestring metadata
+~~~~~~~~~~~~~~~~~~~
+
+Valid keys in the metadata dict are:
+ * sha1: Optional ascii representation of the sha1 of the bytestring (after
+ delta reconstruction).
+ * storage_kind: Required kind of storage compression that has been used
+ on the bytestring. One of ``mpdiff``, ``knit-annotated-ft``,
+ ``knit-annotated-delta``, ``knit-ft``, ``knit-delta``, ``fulltext``.
+ * parents: Required graph parents to associate with this bytestring.
+ * compressor_data: Required opaque data relevant to the storage_kind.
+ (This is set to None when the compressor has no special state needed)
+ * key: The key for this bytestring. Like each parent this is a tuple that
+ should have the key_prefix prepended to it to give the unified
+ repository key name.
+..
+ vim: ft=rst tw=74 ai
+
More information about the bazaar-commits
mailing list