[RFC] New generic graph storage format

Fri Sep 7 22:58:38 BST 2007

Get a coffee, you'll be done it before this message.

On 9/7/07, Robert Collins <robertc at robertcollins.net> wrote:
> On Thu, 2007-09-06 at 21:16 -0700, Keir Mierle wrote:
> > I've been working on replacing bzrlib.index.GraphIndexBuilder and
> > bzrlib.index.GraphIndex with something offering operations which
> > are all better than linear for the cases we care about.
>
> Cool. Thank you for taking up this challenge. I may come across as
> kindof picky below, but its simply because this part of the
> infrastructure gets a lot of load put on it with wildly varying
> constraints - latency, bandwidth, cpu, memory, branch vs log vs merge vs
> tree construction.

Absolutely; I posted my not-entirely-baked design looking for this sort of
discussion.

> I've put a copy of one of the converted mozilla branch .tix indices at
> http://people.ubuntu.com/~robertc/0.tix. Its 95M or so in size :).
>
> It should provide solid sample data, though its not in the most recent
> version of the current graph index code - its missing the index record
> count header - but that should be easy to tweak. I'm going to reconvert
> to the current format the moz knit repository and when thats done I'll
> submit another copy for you.
>
> This index is pathological - there are no merges in it at all. Typically
> though 1 in three (IIRC) commits are merges in modern histories, but it
> is still useful for considering scaling properties experimentally. e.g.
> 'how long to find the knit hunks to read to reconstruct text XXX over a
> 300ms SFTP link with 512K per second bandwidth.', or 'how long to find
> the knit hunks to read to reconstruct <set of texts for the entire
> tree>'.

I'll do some experiments with it.

> > First, an introduction. The GraphIndex, as it stands, is used for
> > several parts of bzrlib. However, it does not expose an interface
> > conforming exactly to the idea of an abstract graph, because the
> > GraphIndex 'knows' that a key may be a tuple of subkeys, and it
> > also allows multiple reference-lists to be stored.
>
> I think there are two things conflated here. The key being a tuple of
> strings is not the same as the key being a tuple of subkeys. If you
> think of a composite index in e.g. postgresql, a key in that index is
> the entire tuple.

I am not planning on changing the current GraphIndex interface; I
am planning on building it with my CompactGraph class.

Unless we store the raw string keys (rather than the tuples) in a
separate place (allowing us to store the tuples as (keyID1,
keyID2) where the mapping key->keyID is stored separately) , I see
no reason to support tuples of keys in the layer which does
serialization. It's still just k->v, k->k*. We can build tuples on
top of that.

I think the design will be simpler and clearer this way.

> Allowing multiple reference lists was a decision taken because of the
> relative size of keys and references - e.g. for the 0.tix I've uploaded.
>
> 717921 rows
> size of a key reference - 9 bytes
> 101534630 total bytes
> 2 key references per node (its pathological).
> so ballparking - 12M/96M is in node references, the rest in the each
> node is encoding overhead or the key at that position.
>
> Now, the current toy index is not optimally encoded, but a naive split
> into multiple indices with one graph each will save 6M per index, but
> duplicate 84M.

Assuming it's not compressed, which I am convinced (though not yet
benchmark-convinced) we should do. They keys are highly redundant.
More on this later.

> > I consider this to be a layering violation, not because of picky
> > software engineering reasons, but because having multiple
> > adjacency lists inside the single storage space prevents having
> > the data spatially coherent. Or rather, only one of the adjacency
> > lists can be optimized in this way. If you want to store multiple
> > adjacency lists anyway, put the alternate references in the value
> > for that node or build a separate GraphIndex.
>
> What I read here is 'because only one of the graphs in the index can be
> spatially located, we will only support one graph'. The cause and effect
> here do not line up IMO - an inability to optimise both graphs simply
> implies that only one graph can be optimised.

I buy that.

> Putting the graph into the value will inflate the index size
> substantially.
> So assuming you are right to force this down to one graph
> per index, we will have to solve the spatial inflation costs somehow
> IMO. Commits will write to both the text reconstruction index and the
> file-graph index so more data to write is more overhead.

I've thought of two ways to solve this. The first and most obvious
is what you're already doing, storing the second list of
references (as byte-offset-in-file).

However, I have a different suggestion, which may or may not be
faster. I suggest storing multiple node offsets for a single key
in the TKI; then having multiple nodes sections corresponding to
the different offsets. In other words, the TKI will store multiple
node offsets for a single key (each one corresponding to a
different graph). For example, the key 'foo' may have node offsets
23 and 43, where 23 corresponds to the position of that node's
data (val+refs) in the first graph. In the second graph, the
(vals+refs) are at 43. Then the file would take the structure

preamble
TKI (containing key -> (offset_in_graph1, offset_in_graph2))
graph1
graph2

Whether this is a win or not will depend somewhat on the access
patterns and the data being stored. In the preamble, I can also
imagine saying 'graph 2 has only references and no data' or 'graph
4 has only backrefs into the TKI' to avoid storing the extra byte
for empty values.

Clarification:
Before, I was calling the 'hash' of the key the offset of
that nodes data in the node section. I apologize if that was
confusing. I think of a 'hash' as any mapping which shrinks the
key to some fixed size, even though that is not the strict
definiton; so I consider the mapping key -> position of key's data
in the node section to be a hash. Let's call this the 'node
offset' from now on. To summarize, the TKI (topological key index)
stores a mapping key -> node offset.

Back to our main programming.

I did some experiments to see how much of the keyspace is shared
among the .iix, .rix, .six, and .tix files. I ignored the .tix
file, then calculated the size of the union of the keys and how
much smaller the result would be if only the union was stored.

I also calculated the overhead which would be incurred, assuming
a fixed-width row layout for the blocks in the TKI (because blanks
would have to be stored).

Below are my results for the files found at:
http://people.ubuntu.com/~robertc/baz2.0/.bzr/repository/indices/

Size of the union of the keys of iix,rix,tix for some file
       len union 3
Neat, there's a 60% savings by storing only the union of the keys
       len(union) / sum of len(keys in [irs]x) 0.375
These numbers say what fraction of the column associated with
the particular graph will be blank.
       len(union-iix)/len(union) 0.0
       len(union-rix)/len(union) 0.0
       len(union-six)/len(union) 0.333333333333

len union 70
len(union) / sum of len(keys in [irs]x) 0.454545454545
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.8
len union 207
len(union) / sum of len(keys in [irs]x) 0.456953642384
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.811594202899
len union 8
len(union) / sum of len(keys in [irs]x) 0.444444444444
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.75
len union 79
len(union) / sum of len(keys in [irs]x) 0.454022988506
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.79746835443
len union 13391
len(union) / sum of len(keys in [irs]x) 0.421830209482
len(union-iix)/len(union) 0.000149354043761
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.629228586364
len union 42
len(union) / sum of len(keys in [irs]x) 0.428571428571
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.666666666667
len union 4
len(union) / sum of len(keys in [irs]x) 0.4
len(union-iix)/len(union) 0.0
len(union-rix)/len(union) 0.0
len(union-six)/len(union) 0.5

In summary: it appears there is a clear space gain by merging iix
and rix. With .six it is not so clear; also, .six is merely a
dictionary and not really a graph at all.

What I propose is to make a separate class for the TKI, called
simply Dictionary. Then the TKI is built using a
DictionaryBuilder.  DictionaryBuilder will be a dictionary
serializer optimized for our use case (sparse index, compressed
keys in blocks, etc).  Furthermore, it will make swapping out the
TKI from the GraphIndex simple.

Then the .six file ceases to be a GraphIndex (which it isn't) and
becomes a Dictionary. Or maybe Index. I'm not sure what to call
it.

Then instead of 4 files, we have 3 files
(iix rix) -> ix containing 3 graphs
six -> sig containing a Dictionary
tix -> tix containing 2 graphs

> Now, the knit record hunks are currently only needed for the text
> reconstruction index, but I can see log -v using delta's directly in the
> future so we can't assume that the byte-ranges in the .packs will be
> redundant.
>
> > You may want to skip the Format section and read my comments below
> > first.
> >
> > class CompactGraphIndexBuilder(object):
> >     """Construct a space-and-time efficient DAG serialization.
> >
> >     Given a DAG, this class builds a serialization suitable for rapid querying
> >     and traversal over slow network connections, where access is via readv().
> >
> >     The graph consists of a series of nodes. Each node contains keys (arbitrary
> >     length byte sequences), values (arbitrary length byte sequences), and
> >     references to other nodes.  In other words, the graph contains a mapping
> >     k->v, and a mapping k->k*.
>
> I would *really* prefer keys need to be tuples of byte-sequences as per
> the current graph interface. I started with simple byte-sequences for
> the graph index layer and it was too restrictive, with nearly every
> reuse having to perform its own encoding/decoding - pushing the
> composite key facility into the index made upper code much cleaner.

See above.

> > The on-disk format has a number of properties which are desireable for the
> >     usecases relevant to bzr; specifically, it is compact (keys are compressed
> >     in chunks), keys are indexed (there is an index-for-the-index so finding
> >     the hash for a key doesn't require reading the whole index), and it is
> >     spatially coherent (when traversing a subset of the graph, all successors
> >     are likely nearby).
> >
> >     The on-disk structure breaks down into three components.
> >
> >       (1) The Preamble, which contains metadata about integer sizes and file
> >           format version;
> >       (2) The TopologicalKeyIndex, which is a mapping of key -> node offset
> >           (aka hash) where the offsets correspond to the position of that key's
> >           references and data in the Nodes section
>
> when you say aka hash here, what do you mean? do you mean that you take
> a md5/sha1/crc or something, or are you using 'hash' to refer to the
> address of the node in the file? If the latter, its terribly confusing.
> Please don't say 'hash' when you mean 'page:byte-offset' or something
> like that.

As above, I am using 'hash' to mean the offset of that key's data
(aka value and references) in the nodes section. I'll use 'node
offset' from now on.

> > (3) The Nodes, which contain references to other nodes, some
> >           value for this node, and a backpointer to the full key name in the
> >           TopologicalKeyIndex. The nodes are ordered by a topological sort
> >           (specifically reverse post-order), and identified according to their
> >           offset from the start (their node hash). In other words, the node
> >           hash is monotonically related to the position of that node in the
> >           topological sort.
>
> How are references stored? Are they in the same format as the
> backpointer, or are they pointers to the actual nodes?

The references are stored as node offsets, such that you do not
need to go back to the TKI to traverse if you don't care about the
actual keys, just their values and references.

> Seems to me this layout has a couple of potential tradeoffs.

Yes, so many tradeoffs!

> Referring to nodes by pointers back to the TKI will allow iterating a
> single node to be:
> TKI lookup
> read node
> TKI lookup for all referenced nodes
> done
>
> Iterating just the values though (which is what text reconstruction
> needs - we don't actually need the node names of the elements in the
> path, just the values which contain the things in the pack we need to
> read):
> TKI lookup on first node
> read node
> start loop
> TKI lookup for references
> read nodes
> loop

Slow for checkout, merge, etc.

> Whereas referring to nodes by pointers to the nodes themselves changes
> this:
> accessing a single node:
> TKI lookup
> read node
> read referenced nodes
> TKI access for the name of those nodes
> done.
>
> but iterating just values is now:
> TKI lookup on first node
> read node
> start loop
> read referenced nodes
> loop

This (that if you don't care about the actual keys, you can avoid
ever knowing them or looking them up) is exactly what I intended
to describe, though I was not clear.

> > What does all this mean? Well, the relevant parts as to why this
> > is an improvement over the current GraphIndex, is that the keys
> > are stored compressed, and are indexed (an index-for-the-index).
>
> GraphIndex today compresses the node references using dictionary
> compression. The achieved savings are moderate - the 9 bytes above for
> each reference are referring to keys that are ~80 bytes long - so a 9:1
> compression ratio - though there is lots of duplicate text as the sub
> elements of the tuple keys are not dictionary compressed at all.

First, let's be clear about what dictionary compression is.  I
think of LZW as dictionary compression, however I think you mean
storing the position of something in a dictionary rather than
storing the something itself.

Given that, I am doing the same, except storing it as a binary
unsigned integer of either 1,2, or 4 bytes depending on how big
the nodes section is.

> The index-for-the index would seem to be roughly what would happen if
> the existing GraphIndex, rather than starting each node with an encoded
> tuple started with a bytereference to the location of an encoded
> standalone tuple. I note this similarity because I considered this for
> the current GraphIndex, but discarded it because the relative size of
> the keys and values: keys are >>> value and as the API I was putting
> together then, which needed to match the existing knit code completely
> (I was working on thunking across in the first instance) needed to be
> able to always return the keys, so having the keys separate didn't
> help... If we can eliminate needing to return keys for some queries then
> we can look at packing more values into a small area and get locality of
> reference wins there, which I think your layout will do.

Yes, there will be performance hits until we can eliminate
unnecessary key lookups.

> > The TopologicalKeyIndex is a series of compressed blocks,
> > corresponding to blocks from the list of (key, hash) pairs sorted
> > lexiographically by key. Then there is a header in the
> > TopologicalKeyIndex, which is also compressed, which lists the
> > first key in each block, along with the byte offset of that block
> > in the file.
>
> I'm concerned about the use of compression. It may be that its needed,
> but I'd like to see some benchmarks and tests to let us make educated
> decisions. Compression trades off CPU for bandwidth, but as the index is
> hopefully structured to have no redundant data one would hope there is
> little room for compression to win.

True, however decompression has long since been faster than disk
or network. On my laptop sequential disk reads top out at ~30mb/s,
and on my desktop they top out at 65mb/s. With Python's zlib, my
laptop can decompress at 75MB/s and my desktop can decompress at
150 mb/s. By 'decompress' I mean given compressed data generate
decompressed data at a rate X mb/s.

Assuming 1:5 compression, on a 100mb file, the compressed version
is 20mb. Reading 20mb takes ~0.7 seconds on my laptop, and 0.35
seconds on my desktop. Decompression takes ~1.3s on my laptop and
0.7s on my desktop. Totals for compressed read: 2s laptop, 1s
desktop.

Reading sequentially takes 3s on my laptop, and 1.5s on my
desktop. Decompression is beating straight up sequential reads by
50%.

Things are a bit murkier when parts of the file may already be
hanging around in the system's caches. However, for network
operations, compression is a clear win. Whether compressing flat
list blocks is better than alternate packing strategies (tries,
btree) is still up for debate.

> > Since the header is small we download it in one go.
>
> This seems entirely unjustified. Say there are 700,000 80-byte keys.
> 56,000,000 bytes.  If each 'block' is 4000 bytes in size, we have 14000
> blocks of TKI, and that means (80 + a byte offset) * 14000 bytes, or
> roughly a megabyte of data.
>
> Now I admit thats a lot less than 96M, but still, downloading 1M to
> perform a trivial query is not necessarily good, and probably bad.

The plan was to have several knobs to keep the index small. But I
agree, a staged TKI index is best.

> I would paraphrase your design here as having a sparse index for the
> TKI. An alternative approach might be to use an LPC trie for the index
> of TKI blocks. This would not identify specific keys - it would only
> have enough selectivity to identify the block a TKI entry would be found
> in. I suspect this trie would be about the same size as the LZW
> compressed index.

> Working with it would be much the same, except that
> rather a linear scan through the list of blocks,

Hold on! I never said anything about a linear scan. It would be by
standard binary search; remeber, I am *forcing* the records to be
fixed-width per-block and in the TKI index, so we never have to
look at all the data.

> it would be a trie
> traversal, and we can expect the first 4K to give at least selectivity
> to 1/50th of the keyspace (4000/80)

Compressed by 1:5 this gets you to 4000/16 -> 1/250th. I was
thinking of letting the index part be as big as 32k. at 32k we
have 1/2000'th selectivitiy. That's 350 keys per selection size.
At 16 bytes/key, we have 5600 bytes per block. Which seems
entirely reasonable.

Then to retreive any key, it's a sequential 32k read followed by a
4k read. Also, we now have the whole TKI sparse index in memory,
so further reads will be fast; requiring only a single 4k read.

I'm not saying we have to use this way, just that it's not as
crazy as it sounds.

> will be jumping to another 4K which gives as much incremental
> improvement again. 700000 keys can be selected from in 4 4K trie page
> reads (and for remote indices we'd want to read more than 4K at a time,
> or even structure it with a greater page size though that will impact
> local disk). (By page here I mean a unit of data that we process, not
> necessarily implying that there will be wasted space or a requirement
> that we parse everything in a page).
>
> Another alternative approach would be to use a sparse B+ like tree ,
> which also has the property of pointing to blocks not individual records
> and can be serialised breadth first giving forward-read processing where
> the more read up-front the more selectivity we get, without requiring
> reading of the entire thing.

My approach stores only pointers to blocks in the sparse index. B+
trees could work too.

Separating the TKI into a new Dictionary class will make it easy
to experiment with new approaches, if we can all agree on the
two-part GraphIndex structure of TKI and graph data :)

> > Then, when finding the hash for a key in the TopologicalKeyIndex,
> > we have to read only one other block.
> >
> > Within a single block, the sorted keys/hash pairs is stored as
> > fixed-size rows to make searching for a key which we know is
> > present fast, and not require an additional indexing step. This is
> > sensible when there are many keys with very similar lengths, which
> > is the case for bzr. For the Python implementation, I will likely
> > put the keys/hashes into a dictionary, but for an efficient C
> > version we can directly binary search the bytes because they are
> > sorted.
>
> We profiled doing bisection in python and while I don't know how many
> records one needs, it is faster to avoid parsing all the data we do not
> need up front. Just a note :). Profiling as always is the canonical
> answer.

Was bisection in Python reasonably fast?

> > Questions:
> >  * Use LZW for the blocks? With 5k of sorted keys, it compresses
> >    to about 1k using zlib.compress().
>
> Which keys are you testing this with ? 5:1 is certainly nice to have.

The ones in your repository (mentioned in my experiments above)
and the mozilla .tix. The keys in all files compress within 1% of
20% in all cases.

> > * What about radix trees for the blocks instead? That way there
> >    would be no decompression step because the tree would already
> >    have high entropy. However, backpointers from the node storage
> >    into the key index would be trickier to do.
>
> I think your file description might be clearer if you said there are 4
> things:
> preamble
> key->TKI block mapping
> TKI blocks
> value blocks
>
> To me this makes it more clear about where we can tweak and optimise the
> design. (the TKI blocks can remain unaltered while we change the
> key->TKI block mapping)

Sure.

I prefer to think about it like this, because there appears to be
a natural layering:

CompactGraph:
       Preamble
       TKI mapping (A Dictionary)
       Value blocks

Then the TKI is a Dictionary of some sort. My proposal makes it
look like:

Dictionary:
       key -> block mapping
       list of blocks

This way we can easily experiment with different dictionaries and
also use a plain dictionary for the signature index.

> > * What is the best sorting? Can we optimize the case where we
> >    want checkouts to be fast, (which is a linear traversal) but we
> >    don't care about revisions after unless we are doing log; in
> >    the checkout case we'd want to store some other recent
> >    information there (Thanks to Aaron for pointing this out)
>
> These cases are in different graphs; we probably cannot optimise both in
> a single index but there are options - we can do multiple indices (high
> cost), we could do multiple value blocks (less duplication, but only one
> graph accessible at a time; kindof like two-indices-in-one), we could
> optimise only one graph for locality.

As described above, I think the multiple-node-offsets per key in
the TKI with different graph sections is the way to go. Especially
given the duplication in the keyspace present in iix and rix.

> >  * Perhaps use length + compressed XML hunk for the preamble
> >    instead of the struct-like representation above.
>
> Please no. We're haven't added xml anywhere for quite some time.

Sure.

> > That's all for now; I'm going to proceed and implement the above.
> > Comments are appreciated, nothing is set in stone.
> >
> > Thanks to Aaron and Robert for their patience with me while
> > explaining things on IRC.
>
> Happy to discuss, and I'm really glad to see you've put serious thought
> into this design.
>
> A few more notes - theres no need to give this its own signature and
> classnames - just replace the toy index I wrote. The toy is documented
> as being unstable and to-be-replaced. Its disk format has already
> changed several times since it was merged to mainline.

I'll re-implement the current interface using
CompactGraph and Dictionary.

> I think a useful thing to do before all the code is written is to do
> some modelling: how big will an index be given sample data (e.g.
> bzr.dev's knits). How many separate read operations are needed to
> perform e.g. checkout, or commit, with this index. And so on.
> Instrumenting the current index might be a good way to findout the
> access patterns that are made. I suggest this because it takes time to
> write and tune code; some modelling can help validate the design before
> getting to the code.

I think the design outlined above for graph storage is pretty
reasonable, and by separating the TKI from the rest it should be
straightforward to experiment with different designs. I've taken
my 'vacation' to work on this, and I'd rather write code than
instrumentation (I know, I should eat my vegtables). I'd rather
experiment with different implementations than instrument the
existing code. If someone wants to work on instrumentation in
parallel with me, that'd be great.

Keir

p.s. i'm not sold on the perfect hashes yet. they seem like they will
be slow without a C implementation, at least for the creation stage.