[RFC] New generic graph storage format
Keir Mierle
mierle at gmail.com
Fri Sep 7 05:16:33 BST 2007
I've been working on replacing bzrlib.index.GraphIndexBuilder and
bzrlib.index.GraphIndex with something offering operations which
are all better than linear for the cases we care about.
First, an introduction. The GraphIndex, as it stands, is used for
several parts of bzrlib. However, it does not expose an interface
conforming exactly to the idea of an abstract graph, because the
GraphIndex 'knows' that a key may be a tuple of subkeys, and it
also allows multiple reference-lists to be stored.
I consider this to be a layering violation, not because of picky
software engineering reasons, but because having multiple
adjacency lists inside the single storage space prevents having
the data spatially coherent. Or rather, only one of the adjacency
lists can be optimized in this way. If you want to store multiple
adjacency lists anyway, put the alternate references in the value
for that node or build a separate GraphIndex.
So, what I've started building is a generic write-once graph data
structure designed for fast access locally and across the network
where only readv() is provided. Below is the (DRAFT!) docstring
for the CompactGraphIndex, which describes the format. After the
docstring I include some discussion around why I chose to
structure the CompactGraphIndex this way.
You may want to skip the Format section and read my comments below
first.
class CompactGraphIndexBuilder(object):
"""Construct a space-and-time efficient DAG serialization.
Given a DAG, this class builds a serialization suitable for rapid querying
and traversal over slow network connections, where access is via readv().
The graph consists of a series of nodes. Each node contains keys (arbitrary
length byte sequences), values (arbitrary length byte sequences), and
references to other nodes. In other words, the graph contains a mapping
k->v, and a mapping k->k*.
The on-disk format has a number of properties which are desireable for the
usecases relevant to bzr; specifically, it is compact (keys are compressed
in chunks), keys are indexed (there is an index-for-the-index so finding
the hash for a key doesn't require reading the whole index), and it is
spatially coherent (when traversing a subset of the graph, all successors
are likely nearby).
The on-disk structure breaks down into three components.
(1) The Preamble, which contains metadata about integer sizes and file
format version;
(2) The TopologicalKeyIndex, which is a mapping of key -> node offset
(aka hash) where the offsets correspond to the position of that key's
references and data in the Nodes section
(3) The Nodes, which contain references to other nodes, some
value for this node, and a backpointer to the full key name in the
TopologicalKeyIndex. The nodes are ordered by a topological sort
(specifically reverse post-order), and identified according to their
offset from the start (their node hash). In other words, the node
hash is monotonically related to the position of that node in the
topological sort.
The Format
==========
GraphIndex := Preamble TopologicalKeyIndex Nodes
Preamble := Magic NodesOffset NodeRefSize NodeRefCountSize
MaxKeyLengthSize KeyLengthSize IndexBlockRefSize
ValueLengthSize PreamblePadding
Magic := 'Bazaar Compact Graph 1' NEWLINE
NodesOffset := UINT64 file offset of Nodes (&Nodes)
NodeRefSize := UINT8 Size in bytes of an node reference.
NodeRefCountSize := UINT8 Size of the number of references in a node
MaxKeyLengthSize := UINT8 Size of the max length of a key in an IndexBlock
KeyLengthSize := UINT8 Size of the length of a key in an IndexBlock
IndexBlockRefSize := UINT8 Size of an offset to an IndexBlock
ValueLengthSize := UINT8 Size of a value length
IndexBlockRef := UINT(IndexBlockRefSize)
PreamblePadding := bytes[64 - len(Preamble)]
TopologicalKeyIndex := CompressedHeaderIndexSize CompressedHeaderIndex
CompressedIndexBlock*
CompressedHeaderIndexSize := UINT32
CompressedHeaderIndex := LZW(HeaderIndex)
HeaderIndex := MaxKeyLength KeyIndexBlockRefPair*
KeyIndexBlockRefPair := Key KeyIndexPairPadding IndexBlockRef
KeyIndexPairPadding := bytes[MaxKeyLength - KeyLength]
CompressedIndexBlock := LZW(IndexBlock)
IndexBlock := MaxKeyLength KeyNodeRefPair*
MaxKeyLength := UINT(MaxKeyLengthSize)
KeyNodeRefPair := Key KeyPadding NodeRef
Key := KeyLength KeyData
KeyLength := UINT(KeyLengthSize)
KeyData := bytes[KeyLength]
KeyPadding := NULL[MaxKeyLength - KeyLength]
Nodes := Node*
Node := NodeReferences KeyReference Value
NodeReferences := NumRefs NodeReference{NumRefs}
NumRefs := UINT(NodeRefCountSize)
NodeReference := UINT(NodeRefSize)
This is a byte offset from &Nodes
KeyReference := IndexBlockRef
Value := ValueLength ValueData
ValueLength := UINT(ValueLengthSize)
ValueData := bytes[ValueLength]
Notes:
* Integers are stored in LITTLE ENDIAN for the practical reason that
most computers are Intel.
* All sizes are in bytes
"""
pass
What does all this mean? Well, the relevant parts as to why this
is an improvement over the current GraphIndex, is that the keys
are stored compressed, and are indexed (an index-for-the-index).
The TopologicalKeyIndex is a series of compressed blocks,
corresponding to blocks from the list of (key, hash) pairs sorted
lexiographically by key. Then there is a header in the
TopologicalKeyIndex, which is also compressed, which lists the
first key in each block, along with the byte offset of that block
in the file. Since the header is small we download it in one go.
Then, when finding the hash for a key in the TopologicalKeyIndex,
we have to read only one other block.
Within a single block, the sorted keys/hash pairs is stored as
fixed-size rows to make searching for a key which we know is
present fast, and not require an additional indexing step. This is
sensible when there are many keys with very similar lengths, which
is the case for bzr. For the Python implementation, I will likely
put the keys/hashes into a dictionary, but for an efficient C
version we can directly binary search the bytes because they are
sorted.
Questions:
* Use LZW for the blocks? With 5k of sorted keys, it compresses
to about 1k using zlib.compress().
* What about radix trees for the blocks instead? That way there
would be no decompression step because the tree would already
have high entropy. However, backpointers from the node storage
into the key index would be trickier to do.
* What is the best sorting? Can we optimize the case where we
want checkouts to be fast, (which is a linear traversal) but we
don't care about revisions after unless we are doing log; in
the checkout case we'd want to store some other recent
information there (Thanks to Aaron for pointing this out)
* Perhaps use length + compressed XML hunk for the preamble
instead of the struct-like representation above.
That's all for now; I'm going to proceed and implement the above.
Comments are appreciated, nothing is set in stone.
Thanks to Aaron and Robert for their patience with me while
explaining things on IRC.
Keir
More information about the bazaar
mailing list