===============
Bundle format 4
===============

:Date: 2007-06-21

Motivation
----------
Format 4 is designed to be a compact format that can be generated quickly and
installed into a repository efficiently.  It is not intended to be
human-readable; that responsibility has been given to merge directives.

Format Name
-----------
This is the fourth format to see public use.  Previous versions were 0.7, 0.8,
and 0.9.  Only 0.7's version number was aligned with a Bazaar release.


Dependencies
------------
- Container format 1
- Multiparent diffs
- Bencode


Description
-----------
This format was designed to trade human-readability for speed and compactness.
It does not contain a human-readable "prelude" patch.

Relation to merge directives
----------------------------
A merge directive specifies a merge command to apply and a preview of what that
command would do.  Merge directives may contain a format-4 bundle.  The
bundle's job is to provide the data needed to perform that merge command.

It is recommended that the bundle be provided in a bzip-compressed,
mime64-encoded format, to ensure compactness and resistance to email-transport
damage.

A preview/overview patch may be provided by the merge directive.


Serialization
-------------
Format 4 records revision and inventory records in their repository
serialization format.  This minimizes translation and compression costs
in the common case, where the sender and receiver use the same serialization
format for their repository. Steps have been taken to ensure a faithful
conversion when serialization formats are mismatched.

Record naming
-------------
All records have a single name.  Records are named according to their
content-kind, revision-id, and file-id.

Content-kind may be one of:

:file: a version of a user file
:inventory: the tree inventory
:revision: the revision metadata for a revision
:signature: the revision signature for a revision
:testament: a testament for a revision

Names are constructed like so: "content-kind:revision-id/file-id".
A record has a file-id if-and-only-if it is a file record.

Record metainfo
---------------
The bundle format subdivides a pack record body into a bundle header and body.
The header contains a Bencoded dict of values.  It is separated from the body
by a newline.

:record_kind: The storage strategy of the record.  May be "fulltext" (the
    record body contains the full text of the value), "mpdiff" (the record body
    contains a multi-parent diff of the value), or "header" (the record body is
    empty).
:parents: Used in fulltext and mpdiff records.  The revisions that should be
    noted as parents of this revision in the repository.  For mpdiffs, this is
    also the list of build-parents.
:sha1: Used in mpdiff records.  The sha-1 hash of the full-text value.

Layout
------
The first record is an info/header record.

The subsequent records are mpdiff file records.  The are ordered first by file
id, then in topological order by revision-id.

The next records are mpdiff inventory records.  They are topologically sorted.

The next records are revision and signature fulltexts.  They are interleaved
and topologically sorted.

Implementation notes
--------------------
- knit deltas contain almost enough information to extract the original
  SequenceMatcher.get_matching_blocks() call used to produce them.  Combining
  that information with the relevant fulltexts allows us to avoid performing
  sequence matching on any fulltexts for which we have deltas.

- MultiParent deltas contain get_matching_blocks output almost verbatim, but
  if there is more than one parent, the information about the leftmost parent
  may be incomplete.  However, for single-parent multiparent diffs, we can
  extract the SequenceMatcher.get_matching_blocks output, and therefore
  the SequenceMatcher.get_opcodes output used to create knit deltas.

Installing data across serialization mismatches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In practice, there cannot be revision serialization mismatches, because the
serialization of revisions has been consistent in serializations 5-7

If there is a mismatch in inventory serialization formats, the receiver can

  1. extract the inventory objects for the parents
  2. serialize them using the bundle serialize
  3. apply the mpdiff
  4. calculate the fulltext sha1
  5. compare the calculated sha1 to the expected sha1
  6. deserialize using the bundle serializer
  7. serialize using the repository serializer
  8. add to the repository

This is much slower, of course.  But since the since the fulltext is verified
at step 5, it should be just as safe as any other conversion.