Rev 3351: Draft up an interface for repository streams that is more capable than the in http://people.ubuntu.com/~robertc/baz2.0/versioned_files

Fri Apr 11 02:07:00 BST 2008

At http://people.ubuntu.com/~robertc/baz2.0/versioned_files

------------------------------------------------------------
revno: 3351
revision-id: robertc at robertcollins.net-20080411010629-j07mncp10h10obg8
parent: pqm at pqm.ubuntu.com-20080409233555-n26cmi0y1hb98tf6
committer: Robert Collins <robertc at robertcollins.net>
branch nick: data_stream_revamp
timestamp: Fri 2008-04-11 11:06:29 +1000
message:
  Draft up an interface for repository streams that is more capable than the
  current one.
added:
  doc/developers/repository-stream.txt repositorystream.txt-20080410222511-nh6b9bvscvcerh48-1
modified:
  doc/developers/index.txt       index.txt-20070508041241-qznziunkg0nffhiw-1
=== modified file 'doc/developers/index.txt'

--- a/doc/developers/index.txt	2008-04-08 07:46:55 +0000
+++ b/doc/developers/index.txt	2008-04-11 01:06:29 +0000
@@ -27,6 +27,9 @@
 * `Container format <container-format.html>`_ |--| Notes on a container format
   for streaming and storing Bazaar data.
 
+* `Repository stream <repository-stream.html>`_ |--| Notes on streaming data
+  for repositories (a layer above the container format).
+
 * `Indices <indices.html>`_ |--| The index facilities available within bzrlib.
 
 * `Inventories <inventory.html>`_ |--| Tree shape abstraction.

=== added file 'doc/developers/repository-stream.txt'
--- a/doc/developers/repository-stream.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/repository-stream.txt	2008-04-11 01:06:29 +0000
@@ -0,0 +1,194 @@
+==================
+Repository Streams
+==================
+
+Status
+======
+
+:Date: 2008-04-11
+
+This document describes the proposed programming interface for streaming
+data from and into repositories. This programming interface should allow
+a single interface for pulling data from and inserting data into a Bazaar
+repository.
+
+.. contents::
+
+
+Motivation
+==========
+
+To eliminate the current requirement that extracting data from a
+repository requires either using a slow format, or knowing the format of
+both the source repository and the target repository.
+
+
+Use Cases
+=========
+
+Here's a brief description of use cases this interface is intended to
+support.
+
+Fetch operations
+----------------
+
+We fetch data between repositories as part of push/pull/branch operations.
+Fetching data is currently an very interactive process with lots of
+requests. For performance having the data be supplied in a stream will
+improve push and pull to remote servers. For purely local operations the
+streaming logic should help reduce memory pressure. In fetch operations
+we always know the formats of both the source and target.
+
+Smart server operations
+~~~~~~~~~~~~~~~~~~~~~~~
+
+With the smart server we support one streaming format, but this is only
+usable when both the client and server have the same model of data, and
+requires non-optimal IO ordering for pack to pack operations. Ideally we
+can 
+
+Bundles
+-------
+
+Bundles also create a stream of data for revisions from a repository.
+Unlike fetch operations we do not know the format of the target at the
+time the stream is created. It would be good to be able to treat bundles
+as frozen branches and repositories, so a serialised stream should be
+suitable for this.
+
+Data conversion
+---------------
+
+At this point we are not trying to integrate data conversion into this
+interface, though it is likely possible.
+
+
+Characteristics
+===============
+
+Some key aspects of the described interface are discussed in this section.
+
+Single round trip
+-----------------
+
+All users of this should be able to create an appropriate stream from a
+single round trip.
+
+Forward-only reads
+------------------
+
+There should be no need to seek in a stream when inserting data from it
+into a repository. This places an ordering constraint on streams which
+some repositories do not need.
+
+
+Serialisation
+=============
+
+At this point serialisation of a repository stream has not been specified.
+Some considerations to bear in mind about serialisation are worth noting
+however.
+
+Weaves
+------
+
+While there shouldn't be too many users of weave repositories anymore,
+avoiding pathological behaviour when a weave is being read is a good idea.
+Having the weave itself embedded in the stream is very straight forward
+and does not need expensive on the fly extraction and re-diffing to take
+place.
+
+Bundles
+-------
+
+Being able to perform random reads from a repository stream which is a
+bundle would allow stacking a bundle and a real repository together. This
+will need the pack container format to be used in such a way that we can
+avoid reading more data than needed within the pack container's readv
+interface.
+
+
+Specification
+=============
+
+This describes the interface for requesting a stream, and the programming
+interface a stream must provide. Streams that have been serialised should
+expose the same interface.
+
+Requesting a stream
+-------------------
+
+To request a stream, three parameters are needed:
+
+ * A revision search to select the revisions to include.
+ * A data ordering flag. There are two values for this - 'unordered' and
+   'topological'. 'unordered' streams are useful when inserting into
+   repositories that have the ability to perform atomic insertions.
+   'topological' streams are useful when converting data, or when
+   inserting into repositories that cannot perform atomic insertions (such
+   as knit or weave based repositories).
+ * A complete_inventory flag. When provided this flag signals the stream
+   generator to include all the data needed to construct the inventory of
+   each revision included in the stream, rather than just deltas. This is
+   useful when converting data from a repository with a different
+   inventory serialisation, as pure deltas would not be able to be
+   reconstructed.
+
+
+Structure of a stream
+---------------------
+
+A stream is an object. It can be consistency checked via the ``check``
+method (which consumes the stream). The ``iter_contents`` method can be
+used to iterate the contents of the stream. The contents of the stream are
+a series of top level records, each of which contains one or more
+bytestrings (potentially as a delta against another item in the
+repository) and some optional metadata.
+
+
+Consuming a stream
+------------------
+
+To consume a stream, obtain an iterator from the streams
+``iter_contents`` method. This iterator will yield the top level records.
+Each record has two attributes. One is ``key_prefix`` which is a tuple key
+prefix for the names of each of the bytestrings in the record. The other
+attribute is ``entries``, an iterator of the individual items in the
+record. Each item that the iterator yields is a two-tuple with a meta-data
+dict and the compressed bytestring data.
+
+In pseudocode::
+
+  stream = repository.get_repository_stream(search, UNORDERED, False)
+  for record in stream.iter_contents():
+      for metadata, bytes in record.entries:
+          print "Object %s, compression type %s, %d bytes long." % (
+              record.key_prefix + metadata['key'],
+              metadata['storage_kind'], len(bytes))
+
+This structure should allow stream adapters to be written which can coerce
+all records to the type of compression that a particular client needs. For
+instance, inserting into weaves requires fulltexts, so an adapter that
+applies knit records and extracts them to fulltexts will avoid weaves
+needing to know about all potential storage kinds. Likewise, inserting
+into knits would use an adapter that gives everything as either matching
+knit records or full texts.
+
+bytestring metadata
+~~~~~~~~~~~~~~~~~~~
+
+Valid keys in the metadata dict are:
+ * sha1: Optional ascii representation of the sha1 of the bytestring (after
+   delta reconstruction).
+ * storage_kind: Required kind of storage compression that has been used
+   on the bytestring. One of ``mpdiff``, ``knit-annotated-ft``,
+   ``knit-annotated-delta``, ``knit-ft``, ``knit-delta``, ``fulltext``.
+ * parents: Required graph parents to associate with this bytestring.
+ * compressor_data: Required opaque data relevant to the storage_kind.
+   (This is set to None when the compressor has no special state needed)
+ * key: The key for this bytestring. Like each parent this is a tuple that
+   should have the key_prefix prepended to it to give the unified
+   repository key name.
+..
+   vim: ft=rst tw=74 ai
+