Rev 2519: (Andrew Bennetts) Developer documentation for the propsed container format. in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Fri Jun 8 04:25:47 BST 2007


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 2519
revision-id: pqm at pqm.ubuntu.com-20070608032544-x774g5cn0yh8a42p
parent: pqm at pqm.ubuntu.com-20070607233523-rrkkcg7n2ks0zi18
parent: andrew.bennetts at canonical.com-20070608024719-g3ewkke9l20ts0g9
committer: Canonical.com Patch Queue Manager<pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Fri 2007-06-08 04:25:44 +0100
message:
  (Andrew Bennetts) Developer documentation for the propsed container format.
added:
  doc/developers/container-format.txt containerformat.txt-20070601074309-7n7w1jiyayud6xdn-1
modified:
  Makefile                       Makefile-20050805140406-d96e3498bb61c5bb
  doc/developers/index.txt       index.txt-20070508041241-qznziunkg0nffhiw-1
    ------------------------------------------------------------
    revno: 2499.4.4
    merged: andrew.bennetts at canonical.com-20070608024719-g3ewkke9l20ts0g9
    parent: andrew.bennetts at canonical.com-20070607064153-z6sap26og3pbxxrl
    committer: Andrew Bennetts <andrew.bennetts at canonical.com>
    branch nick: container-format-doc
    timestamp: Fri 2007-06-08 12:47:19 +1000
    message:
      Typos and clarfications thanks to Aaron.
    ------------------------------------------------------------
    revno: 2499.4.3
    merged: andrew.bennetts at canonical.com-20070607064153-z6sap26og3pbxxrl
    parent: andrew.bennetts at canonical.com-20070607040810-7aeblrfw0sl41mun
    committer: Andrew Bennetts <andrew.bennetts at canonical.com>
    branch nick: container-format-doc
    timestamp: Thu 2007-06-07 16:41:53 +1000
    message:
      Updates in response to feedback on mailing list and in person.
    ------------------------------------------------------------
    revno: 2499.4.2
    merged: andrew.bennetts at canonical.com-20070607040810-7aeblrfw0sl41mun
    parent: andrew.bennetts at canonical.com-20070601074703-5z2fps5c8c7c6bjp
    committer: Andrew Bennetts <andrew.bennetts at canonical.com>
    branch nick: container-format-doc
    timestamp: Thu 2007-06-07 14:08:10 +1000
    message:
      Use standard heading markers.
    ------------------------------------------------------------
    revno: 2499.4.1
    merged: andrew.bennetts at canonical.com-20070601074703-5z2fps5c8c7c6bjp
    parent: pqm at pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5
    committer: Andrew Bennetts <andrew.bennetts at canonical.com>
    branch nick: container-format-doc
    timestamp: Fri 2007-06-01 17:47:03 +1000
    message:
      First draft of container format developer doc, based on discussion with Robert.
=== added file 'doc/developers/container-format.txt'
--- a/doc/developers/container-format.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/container-format.txt	2007-06-08 02:47:19 +0000
@@ -0,0 +1,248 @@
+================
+Container format
+================
+
+Status
+======
+
+:Date: 2007-06-07
+
+This document describes the proposed container format for streaming and
+storing collections of data in Bazaar.  Initially this will be used for
+streaming revision data for incremental push/pull in the smart server for
+0.18, but the intention is that this will be the basis for much more
+than just that use case.
+
+In particular, this document currently focuses almost exclusively on the
+streaming case, and not the on-disk storage case.  It also does not
+discuss the APIs used to manipulate containers and their records.
+
+
+.. contents::
+
+
+Motivation
+==========
+
+To create a low-level file format which is suitable for solving the smart
+server latency problem and whose layout and requirements are extendable in
+future versions of Bazaar, and with no requirements that the smart server
+does not have today.
+
+
+Terminology
+===========
+
+A **container** is a streamable file that contains a series of
+**records**.  Records may have **names**, and consist of bytes.
+
+
+Use Cases
+=========
+
+Here's a brief description of use cases this format is intended to
+support.
+
+Streaming data between a smart server and client
+------------------------------------------------
+
+It would be nice if we could combine multiple containers into a single
+stream by something no more expensive than concatenation (e.g. by omitting
+end/start marker pairs).
+
+This doesn't imply that such a combination necessarily produces a valid
+container (e.g. care must be taken to ensure that names are still unique
+in the combined container), or even a useful container.  It is simply that
+the cost of assembling a new combined container is practically as cheap as
+simple concatenation.
+
+Incremental push or pull
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the use case of incremental push/pull, which is currently (0.16)
+very slow on high-latency links due to the large number of round trips.
+What we'd like is something like the following.
+
+A client will make a request meaning "give me the knit contents for these
+revision IDs" (how the client determines which revision IDs it needs is
+unimportant here).  In response, the server streams a single container of:
+
+  * one record per file-id:revision-id knit gzip contents and graph data,
+  * one record per inventory:revision-id knit gzip contents and graph
+    data,
+  * one record per revision knit gzip contents,
+  * one record per revision signature,
+  * end marker record.
+
+in that order.
+
+Persistent storage on disk
+--------------------------
+
+We want a storage format that allows lock-free writes, which suggests a
+format that uses *rename into place*, and *do not modify after writing*.
+
+Usable before deep model changes to Bazaar
+------------------------------------------
+
+We want a format we can use and refine sooner rather than later.  So it
+should be usable before the anticipated model changes for Bazaar "1.0"
+land, while not conflicting with those changes either.
+
+Specifically, we'd like to have this format in Bazaar 0.18.
+
+Examples of possible record content
+-----------------------------------
+
+  * full texts of file versions
+  * deltas of full texts
+  * revisions
+  * inventories
+  * inventory as tree items e.g. the inventory data for 20 files
+  * revision signatures
+  * per-file graph data
+  * annotation cache
+
+
+Characteristics
+===============
+
+Some key aspects of the described format are discussed in this section.
+
+No length-prefixing of entire container
+---------------------------------------
+
+The overall container is not length-prefixed.  Instead there is an end
+marker so that readers can determine when they have read the entire
+container.  This also does not conflict with the goal of allowing
+single-pass writing.
+
+Structured as a self-contained series of records
+------------------------------------------------
+
+The container contains a series of *records*.  Each record is
+self-delimiting.  Record markers are lightweight.  The overhead in terms
+of bytes and processing for records in this container vs. the raw contents
+of those records is minimal.
+
+Addressing records
+------------------
+
+There is a requirement that each object can be given an arbitrary name.
+Some revision control systems address all content by the SHA-1 digest of
+that content, but this scheme is unsatisfactory for Bazaar's revision
+objects.  We can still allow addressing by SHA-1 digest for those content
+types where it makes sense.
+
+Some proposed object names:
+
+  * to name a revision: "``revision:``\ *revision-id*".  e.g., 
+    `revision:pqm at pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5`.
+  * to name an inventory delta: "``inventory.delta:``\ *revision-id*".  e.g.,
+    `inventory.delta:pqm at pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5`.
+
+It seems likely that we may want to have multiple names for an object.
+This format allows that (by allowing multiple ``name`` headers in a Bytes
+record).
+
+Although records are in principle addressable by name, this specification
+alone doesn't provide for efficient access to a particular record given
+its name.  It is intended that seperate indexes will be maintained to
+provide this.
+
+It is acceptable to have records with no explicit name, if the expected
+use of them does not require them.  For example:
+
+  * a record's content could be self-describing in the context of a
+    particular container, or
+  * a record could be accessed via an index based on SHA-1, or 
+  * when streaming, the first record could be treated specially.
+
+Reasonably cheap for small records
+----------------------------------
+
+The overhead for storing fairly short records (tens of bytes, rather than
+thousands or millions) is minimal.  The minimum overhead is 3 bytes plus
+the length of the decimal representation of the *length* value (for a
+record with no name).
+
+
+Specification
+=============
+
+This describes just a basic layer for storing a simple series of
+"records".  This layer has no intrinsic understanding of the contents of
+those records.
+
+The format is:
+
+  * a **container lead-in**, "``bzr pack format 1\n``",
+  * followed by one or more **records**.
+
+A record is:
+
+  * a 1 byte **kind marker**.
+  * 0 or more bytes of record content, depending on the record type.
+
+Record types
+------------
+
+End Marker
+~~~~~~~~~~
+
+An **End Marker** record:
+
+  * has a kind marker of "``E``",
+  * no content bytes.
+
+End Marker records signal the end of a container.
+
+Bytes
+~~~~~
+
+A **Bytes** record:
+
+  * has a kind marker of "``B``",
+  * followed by a mandatory **content length** [1]_: 
+    "*number*\ ``\n``", where *number* is in decimal, e.g::
+
+      1234
+
+  * followed by zero or more optional **names**: 
+    "*name*\ ``\n``", e.g.::
+
+      revision:pqm at pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5
+
+  * followed by an **end of headers** byte: "``\n``",
+  * followed by some **bytes**, exactly as many as specified by the length
+    prefix header.
+
+So a Bytes record is a series of lines encoding the length and names (if
+any) followed by a body.
+
+For example, this is a possible Bytes record (including the kind marker)::
+
+  B26
+  example-name1
+  example-name2
+
+  abcdefghijklmnopqrstuvwxyz
+
+
+Names
+-----
+
+Names should be UTF-8 encoded strings, with no whitespace.  Names should
+be unique within a single container, but no guarantee of uniqueness
+outside of the container is made by this layer.  Names need to be at least
+one character long.
+
+
+.. [1] This requires that the writer of a record knows the full length of
+  the record up front, which typically means it will need to buffer an
+  entire record in memory.  For the first version of this format this is
+  considered to be acceptable.
+
+..
+   vim: ft=rst tw=74 ai
+

=== modified file 'Makefile'
--- a/Makefile	2007-06-06 07:57:00 +0000
+++ b/Makefile	2007-06-08 03:25:44 +0000
@@ -86,13 +86,13 @@
 pretty_files: $(patsubst doc/%.txt, $(PRETTYDIR)/%.htm, $(txt_files))
 
 doc/developers/%.htm: doc/developers/%.txt
-	python tools/rst2html.py --link-stylesheet --stylesheet=../default.css $< $@
+	python tools/rst2html.py --link-stylesheet --stylesheet=../default.css --footnote-references=superscript $< $@
 
 doc/developers/HACKING.htm: doc/developers/HACKING
-	python tools/rst2html.py --link-stylesheet --stylesheet=../default.css $< $@
+	python tools/rst2html.py --link-stylesheet --stylesheet=../default.css --footnote-references=superscript $< $@
 
 %.htm: %.txt
-	python tools/rst2html.py --link-stylesheet --stylesheet=default.css $< $@
+	python tools/rst2html.py --link-stylesheet --stylesheet=default.css --footnote-references=superscript $< $@
 
 $(PRETTYDIR)/%.htm: pretty_docs doc/%.txt
 	python tools/rst2prettyhtml.py doc/bazaar-vcs.org.kid doc/$*.txt \

=== modified file 'doc/developers/index.txt'
--- a/doc/developers/index.txt	2007-05-08 04:14:01 +0000
+++ b/doc/developers/index.txt	2007-06-01 07:47:03 +0000
@@ -13,3 +13,8 @@
 * `Performance roadmap <performance-roadmap.htm>`_
 
   The roadmap for fixing performance in bzr over the next few releases.
+
+* `Container format <container-format.htm>`_
+
+  Notes on a container format for streaming and storing Bazaar data.
+




More information about the bazaar-commits mailing list