[RFC] New generic graph storage format

Fri Sep 7 13:25:44 BST 2007

On Thu, 2007-09-06 at 21:16 -0700, Keir Mierle wrote: 
> I've been working on replacing bzrlib.index.GraphIndexBuilder and
> bzrlib.index.GraphIndex with something offering operations which
> are all better than linear for the cases we care about.

Cool. Thank you for taking up this challenge. I may come across as
kindof picky below, but its simply because this part of the
infrastructure gets a lot of load put on it with wildly varying
constraints - latency, bandwidth, cpu, memory, branch vs log vs merge vs
tree construction.

I've put a copy of one of the converted mozilla branch .tix indices at
http://people.ubuntu.com/~robertc/0.tix. Its 95M or so in size :).

It should provide solid sample data, though its not in the most recent
version of the current graph index code - its missing the index record
count header - but that should be easy to tweak. I'm going to reconvert
to the current format the moz knit repository and when thats done I'll
submit another copy for you.

This index is pathological - there are no merges in it at all. Typically
though 1 in three (IIRC) commits are merges in modern histories, but it
is still useful for considering scaling properties experimentally. e.g.
'how long to find the knit hunks to read to reconstruct text XXX over a
300ms SFTP link with 512K per second bandwidth.', or 'how long to find
the knit hunks to read to reconstruct <set of texts for the entire
tree>'.

> First, an introduction. The GraphIndex, as it stands, is used for
> several parts of bzrlib. However, it does not expose an interface
> conforming exactly to the idea of an abstract graph, because the
> GraphIndex 'knows' that a key may be a tuple of subkeys, and it
> also allows multiple reference-lists to be stored.

I think there are two things conflated here. The key being a tuple of
strings is not the same as the key being a tuple of subkeys. If you
think of a composite index in e.g. postgresql, a key in that index is
the entire tuple.

Allowing multiple reference lists was a decision taken because of the
relative size of keys and references - e.g. for the 0.tix I've uploaded.

717921 rows
size of a key reference - 9 bytes
101534630 total bytes
2 key references per node (its pathological).
so ballparking - 12M/96M is in node references, the rest in the each
node is encoding overhead or the key at that position.

Now, the current toy index is not optimally encoded, but a naive split
into multiple indices with one graph each will save 6M per index, but
duplicate 84M.

> I consider this to be a layering violation, not because of picky
> software engineering reasons, but because having multiple
> adjacency lists inside the single storage space prevents having
> the data spatially coherent. Or rather, only one of the adjacency
> lists can be optimized in this way. If you want to store multiple
> adjacency lists anyway, put the alternate references in the value
> for that node or build a separate GraphIndex.

What I read here is 'because only one of the graphs in the index can be
spatially located, we will only support one graph'. The cause and effect
here do not line up IMO - an inability to optimise both graphs simply
implies that only one graph can be optimised.

Putting the graph into the value will inflate the index size
substantially. So assuming you are right to force this down to one graph
per index, we will have to solve the spatial inflation costs somehow
IMO. Commits will write to both the text reconstruction index and the
file-graph index so more data to write is more overhead.

Now, the knit record hunks are currently only needed for the text
reconstruction index, but I can see log -v using delta's directly in the
future so we can't assume that the byte-ranges in the .packs will be
redundant.

> You may want to skip the Format section and read my comments below
> first.
> 
> class CompactGraphIndexBuilder(object):
>     """Construct a space-and-time efficient DAG serialization.
> 
>     Given a DAG, this class builds a serialization suitable for rapid querying
>     and traversal over slow network connections, where access is via readv().
> 
>     The graph consists of a series of nodes. Each node contains keys (arbitrary
>     length byte sequences), values (arbitrary length byte sequences), and
>     references to other nodes.  In other words, the graph contains a mapping
>     k->v, and a mapping k->k*.

I would *really* prefer keys need to be tuples of byte-sequences as per
the current graph interface. I started with simple byte-sequences for
the graph index layer and it was too restrictive, with nearly every
reuse having to perform its own encoding/decoding - pushing the
composite key facility into the index made upper code much cleaner.

> The on-disk format has a number of properties which are desireable for the
>     usecases relevant to bzr; specifically, it is compact (keys are compressed
>     in chunks), keys are indexed (there is an index-for-the-index so finding
>     the hash for a key doesn't require reading the whole index), and it is
>     spatially coherent (when traversing a subset of the graph, all successors
>     are likely nearby).
> 
>     The on-disk structure breaks down into three components.
> 
>       (1) The Preamble, which contains metadata about integer sizes and file
>           format version;
>       (2) The TopologicalKeyIndex, which is a mapping of key -> node offset
>           (aka hash) where the offsets correspond to the position of that key's
>           references and data in the Nodes section

when you say aka hash here, what do you mean? do you mean that you take
a md5/sha1/crc or something, or are you using 'hash' to refer to the
address of the node in the file? If the latter, its terribly confusing.
Please don't say 'hash' when you mean 'page:byte-offset' or something
like that.

> (3) The Nodes, which contain references to other nodes, some
>           value for this node, and a backpointer to the full key name in the
>           TopologicalKeyIndex. The nodes are ordered by a topological sort
>           (specifically reverse post-order), and identified according to their
>           offset from the start (their node hash). In other words, the node
>           hash is monotonically related to the position of that node in the
>           topological sort.

How are references stored? Are they in the same format as the
backpointer, or are they pointers to the actual nodes?

Seems to me this layout has a couple of potential tradeoffs.
Referring to nodes by pointers back to the TKI will allow iterating a
single node to be:
TKI lookup
read node
TKI lookup for all referenced nodes
done

Iterating just the values though (which is what text reconstruction
needs - we don't actually need the node names of the elements in the
path, just the values which contain the things in the pack we need to
read):
TKI lookup on first node
read node
start loop
TKI lookup for references
read nodes
loop

Whereas referring to nodes by pointers to the nodes themselves changes
this:
accessing a single node:
TKI lookup
read node
read referenced nodes
TKI access for the name of those nodes
done.

but iterating just values is now:
TKI lookup on first node
read node
start loop
read referenced nodes
loop

> What does all this mean? Well, the relevant parts as to why this
> is an improvement over the current GraphIndex, is that the keys
> are stored compressed, and are indexed (an index-for-the-index).

GraphIndex today compresses the node references using dictionary
compression. The achieved savings are moderate - the 9 bytes above for
each reference are referring to keys that are ~80 bytes long - so a 9:1
compression ratio - though there is lots of duplicate text as the sub
elements of the tuple keys are not dictionary compressed at all.

The index-for-the index would seem to be roughly what would happen if
the existing GraphIndex, rather than starting each node with an encoded
tuple started with a bytereference to the location of an encoded
standalone tuple. I note this similarity because I considered this for
the current GraphIndex, but discarded it because the relative size of
the keys and values: keys are >>> value and as the API I was putting
together then, which needed to match the existing knit code completely
(I was working on thunking across in the first instance) needed to be
able to always return the keys, so having the keys separate didn't
help... If we can eliminate needing to return keys for some queries then
we can look at packing more values into a small area and get locality of
reference wins there, which I think your layout will do.

> The TopologicalKeyIndex is a series of compressed blocks,
> corresponding to blocks from the list of (key, hash) pairs sorted
> lexiographically by key. Then there is a header in the
> TopologicalKeyIndex, which is also compressed, which lists the
> first key in each block, along with the byte offset of that block
> in the file.

I'm concerned about the use of compression. It may be that its needed,
but I'd like to see some benchmarks and tests to let us make educated
decisions. Compression trades off CPU for bandwidth, but as the index is
hopefully structured to have no redundant data one would hope there is
little room for compression to win.

> Since the header is small we download it in one go.

This seems entirely unjustified. Say there are 700,000 80-byte keys.
56,000,000 bytes.  If each 'block' is 4000 bytes in size, we have 14000
blocks of TKI, and that means (80 + a byte offset) * 14000 bytes, or
roughly a megabyte of data.

Now I admit thats a lot less than 96M, but still, downloading 1M to
perform a trivial query is not necessarily good, and probably bad.

I would paraphrase your design here as having a sparse index for the
TKI. An alternative approach might be to use an LPC trie for the index
of TKI blocks. This would not identify specific keys - it would only
have enough selectivity to identify the block a TKI entry would be found
in. I suspect this trie would be about the same size as the LZW
compressed index. Working with it would be much the same, except that
rather a linear scan through the list of blocks, it would be a trie
traversal, and we can expect the first 4K to give at least selectivity
to 1/50th of the keyspace (4000/80) and when that isn't sufficient we
will be jumping to another 4K which gives as much incremental
improvement again. 700000 keys can be selected from in 4 4K trie page
reads (and for remote indices we'd want to read more than 4K at a time,
or even structure it with a greater page size though that will impact
local disk). (By page here I mean a unit of data that we process, not
necessarily implying that there will be wasted space or a requirement
that we parse everything in a page).

Another alternative approach would be to use a sparse B+ like tree ,
which also has the property of pointing to blocks not individual records
and can be serialised breadth first giving forward-read processing where
the more read up-front the more selectivity we get, without requiring
reading of the entire thing.

> Then, when finding the hash for a key in the TopologicalKeyIndex,
> we have to read only one other block.
> 
> Within a single block, the sorted keys/hash pairs is stored as
> fixed-size rows to make searching for a key which we know is
> present fast, and not require an additional indexing step. This is
> sensible when there are many keys with very similar lengths, which
> is the case for bzr. For the Python implementation, I will likely
> put the keys/hashes into a dictionary, but for an efficient C
> version we can directly binary search the bytes because they are
> sorted.

We profiled doing bisection in python and while I don't know how many
records one needs, it is faster to avoid parsing all the data we do not
need up front. Just a note :). Profiling as always is the canonical
answer.

> Questions:
>  * Use LZW for the blocks? With 5k of sorted keys, it compresses
>    to about 1k using zlib.compress().

Which keys are you testing this with ? 5:1 is certainly nice to have.

> * What about radix trees for the blocks instead? That way there
>    would be no decompression step because the tree would already
>    have high entropy. However, backpointers from the node storage
>    into the key index would be trickier to do.

I think your file description might be clearer if you said there are 4
things:
preamble
key->TKI block mapping
TKI blocks
value blocks

To me this makes it more clear about where we can tweak and optimise the
design. (the TKI blocks can remain unaltered while we change the
key->TKI block mapping)

> * What is the best sorting? Can we optimize the case where we
>    want checkouts to be fast, (which is a linear traversal) but we
>    don't care about revisions after unless we are doing log; in
>    the checkout case we'd want to store some other recent
>    information there (Thanks to Aaron for pointing this out)

These cases are in different graphs; we probably cannot optimise both in
a single index but there are options - we can do multiple indices (high
cost), we could do multiple value blocks (less duplication, but only one
graph accessible at a time; kindof like two-indices-in-one), we could
optimise only one graph for locality.

>  * Perhaps use length + compressed XML hunk for the preamble
>    instead of the struct-like representation above.

Please no. We're haven't added xml anywhere for quite some time.

> That's all for now; I'm going to proceed and implement the above.
> Comments are appreciated, nothing is set in stone.
> 
> Thanks to Aaron and Robert for their patience with me while
> explaining things on IRC.

Happy to discuss, and I'm really glad to see you've put serious thought
into this design. 

A few more notes - theres no need to give this its own signature and
classnames - just replace the toy index I wrote. The toy is documented
as being unstable and to-be-replaced. Its disk format has already
changed several times since it was merged to mainline.

I think a useful thing to do before all the code is written is to do
some modelling: how big will an index be given sample data (e.g.
bzr.dev's knits). How many separate read operations are needed to
perform e.g. checkout, or commit, with this index. And so on.
Instrumenting the current index might be a good way to findout the
access patterns that are made. I suggest this because it takes time to
write and tune code; some modelling can help validate the design before
getting to the code.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070907/041fd1b5/attachment-0001.pgp