Introduction to history deltas

Mon Dec 5 18:28:43 GMT 2005

I've been looking at a new history format for bzr, which is modeled
after the fsfs-backend for Subversion.  I post a few notes here to
start a discussion and get some feedback.

Introduction
============

History deltas is a concept of storing all logical changes introduces
by a revision in a single immutable chunk.  The main motivation for
this is to speed up retrieval of revisions over a dumb transport such
as http, and also to minimize the risk of history corruption by bugs
in bzr.

A history delta is made up of revision information plus zero or more
text deltas.  A text delta is the changes introduced by the revision
to a specific file (identified by its file-id) represented as a line
delta.

Each logical file has a corresponding revision node.  The revision
node is a unsorted list of text deltas that is associated with the
file.  It also contains ancestry information for fast graph
generation.

The concept and format is based on the fsfs-backend for Subversion.

Repository layout
=================

The history delta repository is made up out of three different, but
not independent, stores:

 * The transaction store
    Stores yet-to-be-committed history deltas.  

 * The delta store   
    Stores immutable and committed history deltas.

 * The revision node store
    Stores per-file ancestry and delta information.

The transaction store depends on the delta store to retrieve texts
when line-deltas for new history deltas are generated, but also on
the node revision store to retrieve per-file ancestry graphs.

The delta store depend on the node revision store to retrieve text
delta locations.

The directory layout of a history delta repository:

  revisions/
    <<revision-id>>.delta        # committed history deltas
    ...
  transactions/
    <<revision-id>>/
      <<file-id>>                # not-yet-committed text delta for
                                 # file <<file-id>>
  nodes/
    <<file-id>>                  # revision node for <<file-id>>

Transactions
============

When a history delta is about to be constructed (by "bzr commit") a
commit transaction is created.  This creates a directory in the
transaction store named after the revision-id of the soon-to-be
history delta.  When text deltas are generated they are stored as
files in the transaction directory.  When all text deltas has been
generated they are marshalled together with revision information into
a history delta and moved to the delta store.

The main reason to keep text deltas on disk instead of in memory is
that some history deltas may contain a large number of text deltas,
where each text delta can be rather large.  Importing the linux kernel
source would result in a history delta that is several several hundred
megabytes.

History deltas
==============

As stated earlier history deltas contain revision information along
with zero or more text deltas.  The revision information is serialized
using Martin's basicio/rio facility.  

Each text delta has the following attributes:

 * file-id
 * a basis id
 * a payload length
 * zero or more per-file parents.

The basis id is the revision of the file to which the line-delta
should applied to get the full contents of the file.  The reason to
have a separate basis id, and simply using the first parent, is that
it enables techniques such as skip-deltas.

The payload length specifies the text-delta payload size in bytes,
following the header-line.

Below you can see an example of history delta "r123" which contains a
single text-delta for file with file-id "foo-file":

  revision_id: r123
  inventory_sha1: 12ABDS2312AS1...
  parent: r122
  message: change color to blue.
  timestamp: 12132131321
  timezone: -23000

  text-delta foo-file r64 r87
  0,1,1
  blue
  end foo-file

This history delta changes the second line in "foo-file" to "blue".
The line delta is generated against revision 64 of the file.  

Revision nodes
==============

Information about Text deltas associated with a file-id is cached in
something called "revision nodes".  This makes it easy to find out
which revisions touch a specific file, but also makes it easier to
create ancestry graphs for a file.  The format of the file is simple,
and contains a list of lines with the following format:

  revision_id basis_id data_pos data_size [parents]

For fast extraction of text delta data from history delta file the
location and size of the text delta is stored as well.

Revision nodes should be treated as caches.  If information about a
revision can not be found in the revision node, the code should fall
back at inspecting the history delta instead of failing.
 (* XXX this might change *)

Experimental results
====================

I created 132 history deltas, where each history delta contains a
single text delta with its first parent as basis id.  Extracting
the last text version took ~0.3 seconds on my P3 at 800MHz. 

The size of the history deltas are ~796 kb, where the smallest delta
is 156 bytes and the largest is 32 kb.  Note that the text deltas are
not compressed.  The expected history size with compressed deltas is
<290 kb.  Which means it is comparable to knits.

Thanks for reading,

~j