[RFC] New generic graph storage format

Fri Sep 7 05:16:33 BST 2007

I've been working on replacing bzrlib.index.GraphIndexBuilder and
bzrlib.index.GraphIndex with something offering operations which
are all better than linear for the cases we care about.

First, an introduction. The GraphIndex, as it stands, is used for
several parts of bzrlib. However, it does not expose an interface
conforming exactly to the idea of an abstract graph, because the
GraphIndex 'knows' that a key may be a tuple of subkeys, and it
also allows multiple reference-lists to be stored.

I consider this to be a layering violation, not because of picky
software engineering reasons, but because having multiple
adjacency lists inside the single storage space prevents having
the data spatially coherent. Or rather, only one of the adjacency
lists can be optimized in this way. If you want to store multiple
adjacency lists anyway, put the alternate references in the value
for that node or build a separate GraphIndex.

So, what I've started building is a generic write-once graph data
structure designed for fast access locally and across the network
where only readv() is provided. Below is the (DRAFT!) docstring
for the CompactGraphIndex, which describes the format. After the
docstring I include some discussion around why I chose to
structure the CompactGraphIndex this way.

You may want to skip the Format section and read my comments below
first.

class CompactGraphIndexBuilder(object):
    """Construct a space-and-time efficient DAG serialization.

    Given a DAG, this class builds a serialization suitable for rapid querying
    and traversal over slow network connections, where access is via readv().

    The graph consists of a series of nodes. Each node contains keys (arbitrary
    length byte sequences), values (arbitrary length byte sequences), and
    references to other nodes.  In other words, the graph contains a mapping
    k->v, and a mapping k->k*.

    The on-disk format has a number of properties which are desireable for the
    usecases relevant to bzr; specifically, it is compact (keys are compressed
    in chunks), keys are indexed (there is an index-for-the-index so finding
    the hash for a key doesn't require reading the whole index), and it is
    spatially coherent (when traversing a subset of the graph, all successors
    are likely nearby).

    The on-disk structure breaks down into three components.

      (1) The Preamble, which contains metadata about integer sizes and file
          format version;
      (2) The TopologicalKeyIndex, which is a mapping of key -> node offset
          (aka hash) where the offsets correspond to the position of that key's
          references and data in the Nodes section
      (3) The Nodes, which contain references to other nodes, some
          value for this node, and a backpointer to the full key name in the
          TopologicalKeyIndex. The nodes are ordered by a topological sort
          (specifically reverse post-order), and identified according to their
          offset from the start (their node hash). In other words, the node
          hash is monotonically related to the position of that node in the
          topological sort.

    The Format
    ==========

    GraphIndex := Preamble TopologicalKeyIndex Nodes

    Preamble          := Magic NodesOffset NodeRefSize NodeRefCountSize
                         MaxKeyLengthSize KeyLengthSize IndexBlockRefSize
                         ValueLengthSize PreamblePadding
    Magic             := 'Bazaar Compact Graph 1' NEWLINE
    NodesOffset       := UINT64 file offset of Nodes (&Nodes)
    NodeRefSize       := UINT8 Size in bytes of an node reference.
    NodeRefCountSize  := UINT8 Size of the number of references in a node
    MaxKeyLengthSize  := UINT8 Size of the max length of a key in an IndexBlock
    KeyLengthSize     := UINT8 Size of the length of a key in an IndexBlock
    IndexBlockRefSize := UINT8 Size of an offset to an IndexBlock
    ValueLengthSize   := UINT8 Size of a value length
    IndexBlockRef     := UINT(IndexBlockRefSize)
    PreamblePadding   := bytes[64 - len(Preamble)]

    TopologicalKeyIndex   := CompressedHeaderIndexSize CompressedHeaderIndex
                             CompressedIndexBlock*
    CompressedHeaderIndexSize := UINT32
    CompressedHeaderIndex := LZW(HeaderIndex)
    HeaderIndex           := MaxKeyLength KeyIndexBlockRefPair*
    KeyIndexBlockRefPair  := Key KeyIndexPairPadding IndexBlockRef
    KeyIndexPairPadding   := bytes[MaxKeyLength - KeyLength]

    CompressedIndexBlock  := LZW(IndexBlock)
    IndexBlock            := MaxKeyLength KeyNodeRefPair*
    MaxKeyLength          := UINT(MaxKeyLengthSize)
    KeyNodeRefPair        := Key KeyPadding NodeRef
    Key                   := KeyLength KeyData
    KeyLength             := UINT(KeyLengthSize)
    KeyData               := bytes[KeyLength]
    KeyPadding            := NULL[MaxKeyLength - KeyLength]

    Nodes          := Node*
    Node           := NodeReferences KeyReference Value
    NodeReferences := NumRefs NodeReference{NumRefs}
    NumRefs        := UINT(NodeRefCountSize)
    NodeReference  := UINT(NodeRefSize)
        This is a byte offset from &Nodes
    KeyReference   := IndexBlockRef
    Value          := ValueLength ValueData
    ValueLength    := UINT(ValueLengthSize)
    ValueData      := bytes[ValueLength]

    Notes: 

     * Integers are stored in LITTLE ENDIAN for the practical reason that
       most computers are Intel. 
     * All sizes are in bytes

    """
    pass

What does all this mean? Well, the relevant parts as to why this
is an improvement over the current GraphIndex, is that the keys
are stored compressed, and are indexed (an index-for-the-index).

The TopologicalKeyIndex is a series of compressed blocks,
corresponding to blocks from the list of (key, hash) pairs sorted
lexiographically by key. Then there is a header in the
TopologicalKeyIndex, which is also compressed, which lists the
first key in each block, along with the byte offset of that block
in the file. Since the header is small we download it in one go.
Then, when finding the hash for a key in the TopologicalKeyIndex,
we have to read only one other block.

Within a single block, the sorted keys/hash pairs is stored as
fixed-size rows to make searching for a key which we know is
present fast, and not require an additional indexing step. This is
sensible when there are many keys with very similar lengths, which
is the case for bzr. For the Python implementation, I will likely
put the keys/hashes into a dictionary, but for an efficient C
version we can directly binary search the bytes because they are
sorted.

Questions:
 * Use LZW for the blocks? With 5k of sorted keys, it compresses
   to about 1k using zlib.compress().
 * What about radix trees for the blocks instead? That way there
   would be no decompression step because the tree would already
   have high entropy. However, backpointers from the node storage
   into the key index would be trickier to do.
 * What is the best sorting? Can we optimize the case where we
   want checkouts to be fast, (which is a linear traversal) but we
   don't care about revisions after unless we are doing log; in
   the checkout case we'd want to store some other recent
   information there (Thanks to Aaron for pointing this out)
 * Perhaps use length + compressed XML hunk for the preamble
   instead of the struct-like representation above.

That's all for now; I'm going to proceed and implement the above.
Comments are appreciated, nothing is set in stone.

Thanks to Aaron and Robert for their patience with me while
explaining things on IRC.

Keir