[RFC] New generic graph storage format

Robert Collins robertc at robertcollins.net
Fri Sep 7 23:10:44 BST 2007


On Fri, 2007-09-07 at 14:58 -0700, Keir Mierle wrote:
> Get a coffee, you'll be done it before this message.

:)



> On 9/7/07, Robert Collins <robertc at robertcollins.net> wrote:
> > On Thu, 2007-09-06 at 21:16 -0700, Keir Mierle wrote:
> > > I've been working on replacing bzrlib.index.GraphIndexBuilder and
> > > bzrlib.index.GraphIndex with something offering operations which
> > > are all better than linear for the cases we care about.
> >
> > Cool. Thank you for taking up this challenge. I may come across as
> > kindof picky below, but its simply because this part of the
> > infrastructure gets a lot of load put on it with wildly varying
> > constraints - latency, bandwidth, cpu, memory, branch vs log vs merge vs
> > tree construction.
> 
> Absolutely; I posted my not-entirely-baked design looking for this sort of
> discussion.
> 
> > I've put a copy of one of the converted mozilla branch .tix indices at
> > http://people.ubuntu.com/~robertc/0.tix. Its 95M or so in size :).
> >
> > It should provide solid sample data, though its not in the most recent
> > version of the current graph index code - its missing the index record
> > count header - but that should be easy to tweak. I'm going to reconvert
> > to the current format the moz knit repository and when thats done I'll
> > submit another copy for you.
> >
> > This index is pathological - there are no merges in it at all. Typically
> > though 1 in three (IIRC) commits are merges in modern histories, but it
> > is still useful for considering scaling properties experimentally. e.g.
> > 'how long to find the knit hunks to read to reconstruct text XXX over a
> > 300ms SFTP link with 512K per second bandwidth.', or 'how long to find
> > the knit hunks to read to reconstruct <set of texts for the entire
> > tree>'.
> 
> I'll do some experiments with it.
> 
> > > First, an introduction. The GraphIndex, as it stands, is used for
> > > several parts of bzrlib. However, it does not expose an interface
> > > conforming exactly to the idea of an abstract graph, because the
> > > GraphIndex 'knows' that a key may be a tuple of subkeys, and it
> > > also allows multiple reference-lists to be stored.
> >
> > I think there are two things conflated here. The key being a tuple of
> > strings is not the same as the key being a tuple of subkeys. If you
> > think of a composite index in e.g. postgresql, a key in that index is
> > the entire tuple.
> 
> I am not planning on changing the current GraphIndex interface; I
> am planning on building it with my CompactGraph class.
> 
> Unless we store the raw string keys (rather than the tuples) in a
> separate place (allowing us to store the tuples as (keyID1,
> keyID2) where the mapping key->keyID is stored separately) , I see
> no reason to support tuples of keys in the layer which does
> serialization. It's still just k->v, k->k*. We can build tuples on
> top of that.
> 
> I think the design will be simpler and clearer this way.
> 
> > Allowing multiple reference lists was a decision taken because of the
> > relative size of keys and references - e.g. for the 0.tix I've uploaded.
> >
> > 717921 rows
> > size of a key reference - 9 bytes
> > 101534630 total bytes
> > 2 key references per node (its pathological).
> > so ballparking - 12M/96M is in node references, the rest in the each
> > node is encoding overhead or the key at that position.
> >
> > Now, the current toy index is not optimally encoded, but a naive split
> > into multiple indices with one graph each will save 6M per index, but
> > duplicate 84M.
> 
> Assuming it's not compressed, which I am convinced (though not yet
> benchmark-convinced) we should do. They keys are highly redundant.
> More on this later.
> 
> > > I consider this to be a layering violation, not because of picky
> > > software engineering reasons, but because having multiple
> > > adjacency lists inside the single storage space prevents having
> > > the data spatially coherent. Or rather, only one of the adjacency
> > > lists can be optimized in this way. If you want to store multiple
> > > adjacency lists anyway, put the alternate references in the value
> > > for that node or build a separate GraphIndex.
> >
> > What I read here is 'because only one of the graphs in the index can be
> > spatially located, we will only support one graph'. The cause and effect
> > here do not line up IMO - an inability to optimise both graphs simply
> > implies that only one graph can be optimised.
> 
> I buy that.
> 
> > Putting the graph into the value will inflate the index size
> > substantially.
> > So assuming you are right to force this down to one graph
> > per index, we will have to solve the spatial inflation costs somehow
> > IMO. Commits will write to both the text reconstruction index and the
> > file-graph index so more data to write is more overhead.
> 
> I've thought of two ways to solve this. The first and most obvious
> is what you're already doing, storing the second list of
> references (as byte-offset-in-file).
> 
> However, I have a different suggestion, which may or may not be
> faster. I suggest storing multiple node offsets for a single key
> in the TKI; then having multiple nodes sections corresponding to
> the different offsets. In other words, the TKI will store multiple
> node offsets for a single key (each one corresponding to a
> different graph). For example, the key 'foo' may have node offsets
> 23 and 43, where 23 corresponds to the position of that node's
> data (val+refs) in the first graph. In the second graph, the
> (vals+refs) are at 43. Then the file would take the structure
> 
> preamble
> TKI (containing key -> (offset_in_graph1, offset_in_graph2))
> graph1
> graph2
> 
> Whether this is a win or not will depend somewhat on the access
> patterns and the data being stored. In the preamble, I can also
> imagine saying 'graph 2 has only references and no data' or 'graph
> 4 has only backrefs into the TKI' to avoid storing the extra byte
> for empty values.
> 
> Clarification:
> Before, I was calling the 'hash' of the key the offset of
> that nodes data in the node section. I apologize if that was
> confusing. I think of a 'hash' as any mapping which shrinks the
> key to some fixed size, even though that is not the strict
> definiton; so I consider the mapping key -> position of key's data
> in the node section to be a hash. Let's call this the 'node
> offset' from now on. To summarize, the TKI (topological key index)
> stores a mapping key -> node offset.
> 
> Back to our main programming.
> 
> I did some experiments to see how much of the keyspace is shared
> among the .iix, .rix, .six, and .tix files. I ignored the .tix
> file, then calculated the size of the union of the keys and how
> much smaller the result would be if only the union was stored.
> 
> I also calculated the overhead which would be incurred, assuming
> a fixed-width row layout for the blocks in the TKI (because blanks
> would have to be stored).
> 
> Below are my results for the files found at:
> http://people.ubuntu.com/~robertc/baz2.0/.bzr/repository/indices/
> 
> Size of the union of the keys of iix,rix,tix for some file
>        len union 3
> Neat, there's a 60% savings by storing only the union of the keys
>        len(union) / sum of len(keys in [irs]x) 0.375
> These numbers say what fraction of the column associated with
> the particular graph will be blank.
>        len(union-iix)/len(union) 0.0
>        len(union-rix)/len(union) 0.0
>        len(union-six)/len(union) 0.333333333333
> 
> len union 70
> len(union) / sum of len(keys in [irs]x) 0.454545454545
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.8
> len union 207
> len(union) / sum of len(keys in [irs]x) 0.456953642384
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.811594202899
> len union 8
> len(union) / sum of len(keys in [irs]x) 0.444444444444
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.75
> len union 79
> len(union) / sum of len(keys in [irs]x) 0.454022988506
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.79746835443
> len union 13391
> len(union) / sum of len(keys in [irs]x) 0.421830209482
> len(union-iix)/len(union) 0.000149354043761
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.629228586364
> len union 42
> len(union) / sum of len(keys in [irs]x) 0.428571428571
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.666666666667
> len union 4
> len(union) / sum of len(keys in [irs]x) 0.4
> len(union-iix)/len(union) 0.0
> len(union-rix)/len(union) 0.0
> len(union-six)/len(union) 0.5
> 
> In summary: it appears there is a clear space gain by merging iix
> and rix. With .six it is not so clear; also, .six is merely a
> dictionary and not really a graph at all.
> 
> What I propose is to make a separate class for the TKI, called
> simply Dictionary. Then the TKI is built using a
> DictionaryBuilder.  DictionaryBuilder will be a dictionary
> serializer optimized for our use case (sparse index, compressed
> keys in blocks, etc).  Furthermore, it will make swapping out the
> TKI from the GraphIndex simple.
> 
> Then the .six file ceases to be a GraphIndex (which it isn't) and
> becomes a Dictionary. Or maybe Index. I'm not sure what to call
> it.
> 
> Then instead of 4 files, we have 3 files
> (iix rix) -> ix containing 3 graphs
> six -> sig containing a Dictionary
> tix -> tix containing 2 graphs
> 
> > Now, the knit record hunks are currently only needed for the text
> > reconstruction index, but I can see log -v using delta's directly in the
> > future so we can't assume that the byte-ranges in the .packs will be
> > redundant.
> >
> > > You may want to skip the Format section and read my comments below
> > > first.
> > >
> > > class CompactGraphIndexBuilder(object):
> > >     """Construct a space-and-time efficient DAG serialization.
> > >
> > >     Given a DAG, this class builds a serialization suitable for rapid querying
> > >     and traversal over slow network connections, where access is via readv().
> > >
> > >     The graph consists of a series of nodes. Each node contains keys (arbitrary
> > >     length byte sequences), values (arbitrary length byte sequences), and
> > >     references to other nodes.  In other words, the graph contains a mapping
> > >     k->v, and a mapping k->k*.
> >
> > I would *really* prefer keys need to be tuples of byte-sequences as per
> > the current graph interface. I started with simple byte-sequences for
> > the graph index layer and it was too restrictive, with nearly every
> > reuse having to perform its own encoding/decoding - pushing the
> > composite key facility into the index made upper code much cleaner.
> 
> See above.
> 
> > > The on-disk format has a number of properties which are desireable for the
> > >     usecases relevant to bzr; specifically, it is compact (keys are compressed
> > >     in chunks), keys are indexed (there is an index-for-the-index so finding
> > >     the hash for a key doesn't require reading the whole index), and it is
> > >     spatially coherent (when traversing a subset of the graph, all successors
> > >     are likely nearby).
> > >
> > >     The on-disk structure breaks down into three components.
> > >
> > >       (1) The Preamble, which contains metadata about integer sizes and file
> > >           format version;
> > >       (2) The TopologicalKeyIndex, which is a mapping of key -> node offset
> > >           (aka hash) where the offsets correspond to the position of that key's
> > >           references and data in the Nodes section
> >
> > when you say aka hash here, what do you mean? do you mean that you take
> > a md5/sha1/crc or something, or are you using 'hash' to refer to the
> > address of the node in the file? If the latter, its terribly confusing.
> > Please don't say 'hash' when you mean 'page:byte-offset' or something
> > like that.
> 
> As above, I am using 'hash' to mean the offset of that key's data
> (aka value and references) in the nodes section. I'll use 'node
> offset' from now on.
> 
> > > (3) The Nodes, which contain references to other nodes, some
> > >           value for this node, and a backpointer to the full key name in the
> > >           TopologicalKeyIndex. The nodes are ordered by a topological sort
> > >           (specifically reverse post-order), and identified according to their
> > >           offset from the start (their node hash). In other words, the node
> > >           hash is monotonically related to the position of that node in the
> > >           topological sort.
> >
> > How are references stored? Are they in the same format as the
> > backpointer, or are they pointers to the actual nodes?
> 
> The references are stored as node offsets, such that you do not
> need to go back to the TKI to traverse if you don't care about the
> actual keys, just their values and references.
> 
> > Seems to me this layout has a couple of potential tradeoffs.
> 
> Yes, so many tradeoffs!
> 
> > Referring to nodes by pointers back to the TKI will allow iterating a
> > single node to be:
> > TKI lookup
> > read node
> > TKI lookup for all referenced nodes
> > done
> >
> > Iterating just the values though (which is what text reconstruction
> > needs - we don't actually need the node names of the elements in the
> > path, just the values which contain the things in the pack we need to
> > read):
> > TKI lookup on first node
> > read node
> > start loop
> > TKI lookup for references
> > read nodes
> > loop
> 
> Slow for checkout, merge, etc.
> 
> > Whereas referring to nodes by pointers to the nodes themselves changes
> > this:
> > accessing a single node:
> > TKI lookup
> > read node
> > read referenced nodes
> > TKI access for the name of those nodes
> > done.
> >
> > but iterating just values is now:
> > TKI lookup on first node
> > read node
> > start loop
> > read referenced nodes
> > loop
> 
> This (that if you don't care about the actual keys, you can avoid
> ever knowing them or looking them up) is exactly what I intended
> to describe, though I was not clear.
> 
> > > What does all this mean? Well, the relevant parts as to why this
> > > is an improvement over the current GraphIndex, is that the keys
> > > are stored compressed, and are indexed (an index-for-the-index).
> >
> > GraphIndex today compresses the node references using dictionary
> > compression. The achieved savings are moderate - the 9 bytes above for
> > each reference are referring to keys that are ~80 bytes long - so a 9:1
> > compression ratio - though there is lots of duplicate text as the sub
> > elements of the tuple keys are not dictionary compressed at all.
> 
> First, let's be clear about what dictionary compression is.  I
> think of LZW as dictionary compression, however I think you mean
> storing the position of something in a dictionary rather than
> storing the something itself.
> 
> Given that, I am doing the same, except storing it as a binary
> unsigned integer of either 1,2, or 4 bytes depending on how big
> the nodes section is.
> 
> > The index-for-the index would seem to be roughly what would happen if
> > the existing GraphIndex, rather than starting each node with an encoded
> > tuple started with a bytereference to the location of an encoded
> > standalone tuple. I note this similarity because I considered this for
> > the current GraphIndex, but discarded it because the relative size of
> > the keys and values: keys are >>> value and as the API I was putting
> > together then, which needed to match the existing knit code completely
> > (I was working on thunking across in the first instance) needed to be
> > able to always return the keys, so having the keys separate didn't
> > help... If we can eliminate needing to return keys for some queries then
> > we can look at packing more values into a small area and get locality of
> > reference wins there, which I think your layout will do.
> 
> Yes, there will be performance hits until we can eliminate
> unnecessary key lookups.
> 
> > > The TopologicalKeyIndex is a series of compressed blocks,
> > > corresponding to blocks from the list of (key, hash) pairs sorted
> > > lexiographically by key. Then there is a header in the
> > > TopologicalKeyIndex, which is also compressed, which lists the
> > > first key in each block, along with the byte offset of that block
> > > in the file.
> >
> > I'm concerned about the use of compression. It may be that its needed,
> > but I'd like to see some benchmarks and tests to let us make educated
> > decisions. Compression trades off CPU for bandwidth, but as the index is
> > hopefully structured to have no redundant data one would hope there is
> > little room for compression to win.
> 
> True, however decompression has long since been faster than disk
> or network. On my laptop sequential disk reads top out at ~30mb/s,
> and on my desktop they top out at 65mb/s. With Python's zlib, my
> laptop can decompress at 75MB/s and my desktop can decompress at
> 150 mb/s. By 'decompress' I mean given compressed data generate
> decompressed data at a rate X mb/s.
> 
> Assuming 1:5 compression, on a 100mb file, the compressed version
> is 20mb. Reading 20mb takes ~0.7 seconds on my laptop, and 0.35
> seconds on my desktop. Decompression takes ~1.3s on my laptop and
> 0.7s on my desktop. Totals for compressed read: 2s laptop, 1s
> desktop.
> 
> Reading sequentially takes 3s on my laptop, and 1.5s on my
> desktop. Decompression is beating straight up sequential reads by
> 50%.
> 
> Things are a bit murkier when parts of the file may already be
> hanging around in the system's caches. However, for network
> operations, compression is a clear win. Whether compressing flat
> list blocks is better than alternate packing strategies (tries,
> btree) is still up for debate.
> 
> > > Since the header is small we download it in one go.
> >
> > This seems entirely unjustified. Say there are 700,000 80-byte keys.
> > 56,000,000 bytes.  If each 'block' is 4000 bytes in size, we have 14000
> > blocks of TKI, and that means (80 + a byte offset) * 14000 bytes, or
> > roughly a megabyte of data.
> >
> > Now I admit thats a lot less than 96M, but still, downloading 1M to
> > perform a trivial query is not necessarily good, and probably bad.
> 
> The plan was to have several knobs to keep the index small. But I
> agree, a staged TKI index is best.
> 
> > I would paraphrase your design here as having a sparse index for the
> > TKI. An alternative approach might be to use an LPC trie for the index
> > of TKI blocks. This would not identify specific keys - it would only
> > have enough selectivity to identify the block a TKI entry would be found
> > in. I suspect this trie would be about the same size as the LZW
> > compressed index.
> 
> > Working with it would be much the same, except that
> > rather a linear scan through the list of blocks,
> 
> Hold on! I never said anything about a linear scan. It would be by
> standard binary search; remeber, I am *forcing* the records to be
> fixed-width per-block and in the TKI index, so we never have to
> look at all the data.
> 
> > it would be a trie
> > traversal, and we can expect the first 4K to give at least selectivity
> > to 1/50th of the keyspace (4000/80)
> 
> Compressed by 1:5 this gets you to 4000/16 -> 1/250th. I was
> thinking of letting the index part be as big as 32k. at 32k we
> have 1/2000'th selectivitiy. That's 350 keys per selection size.
> At 16 bytes/key, we have 5600 bytes per block. Which seems
> entirely reasonable.
> 
> Then to retreive any key, it's a sequential 32k read followed by a
> 4k read. Also, we now have the whole TKI sparse index in memory,
> so further reads will be fast; requiring only a single 4k read.
> 
> I'm not saying we have to use this way, just that it's not as
> crazy as it sounds.
> 
> > will be jumping to another 4K which gives as much incremental
> > improvement again. 700000 keys can be selected from in 4 4K trie page
> > reads (and for remote indices we'd want to read more than 4K at a time,
> > or even structure it with a greater page size though that will impact
> > local disk). (By page here I mean a unit of data that we process, not
> > necessarily implying that there will be wasted space or a requirement
> > that we parse everything in a page).
> >
> > Another alternative approach would be to use a sparse B+ like tree ,
> > which also has the property of pointing to blocks not individual records
> > and can be serialised breadth first giving forward-read processing where
> > the more read up-front the more selectivity we get, without requiring
> > reading of the entire thing.
> 
> My approach stores only pointers to blocks in the sparse index. B+
> trees could work too.
> 
> Separating the TKI into a new Dictionary class will make it easy
> to experiment with new approaches, if we can all agree on the
> two-part GraphIndex structure of TKI and graph data :)
> 
> > > Then, when finding the hash for a key in the TopologicalKeyIndex,
> > > we have to read only one other block.
> > >
> > > Within a single block, the sorted keys/hash pairs is stored as
> > > fixed-size rows to make searching for a key which we know is
> > > present fast, and not require an additional indexing step. This is
> > > sensible when there are many keys with very similar lengths, which
> > > is the case for bzr. For the Python implementation, I will likely
> > > put the keys/hashes into a dictionary, but for an efficient C
> > > version we can directly binary search the bytes because they are
> > > sorted.
> >
> > We profiled doing bisection in python and while I don't know how many
> > records one needs, it is faster to avoid parsing all the data we do not
> > need up front. Just a note :). Profiling as always is the canonical
> > answer.
> 
> Was bisection in Python reasonably fast?
> 
> > > Questions:
> > >  * Use LZW for the blocks? With 5k of sorted keys, it compresses
> > >    to about 1k using zlib.compress().
> >
> > Which keys are you testing this with ? 5:1 is certainly nice to have.
> 
> The ones in your repository (mentioned in my experiments above)
> and the mozilla .tix. The keys in all files compress within 1% of
> 20% in all cases.
> 
> > > * What about radix trees for the blocks instead? That way there
> > >    would be no decompression step because the tree would already
> > >    have high entropy. However, backpointers from the node storage
> > >    into the key index would be trickier to do.
> >
> > I think your file description might be clearer if you said there are 4
> > things:
> > preamble
> > key->TKI block mapping
> > TKI blocks
> > value blocks
> >
> > To me this makes it more clear about where we can tweak and optimise the
> > design. (the TKI blocks can remain unaltered while we change the
> > key->TKI block mapping)
> 
> Sure.
> 
> I prefer to think about it like this, because there appears to be
> a natural layering:
> 
> CompactGraph:
>        Preamble
>        TKI mapping (A Dictionary)
>        Value blocks
> 
> Then the TKI is a Dictionary of some sort. My proposal makes it
> look like:
> 
> Dictionary:
>        key -> block mapping
>        list of blocks
> 
> This way we can easily experiment with different dictionaries and
> also use a plain dictionary for the signature index.
> 
> > > * What is the best sorting? Can we optimize the case where we
> > >    want checkouts to be fast, (which is a linear traversal) but we
> > >    don't care about revisions after unless we are doing log; in
> > >    the checkout case we'd want to store some other recent
> > >    information there (Thanks to Aaron for pointing this out)
> >
> > These cases are in different graphs; we probably cannot optimise both in
> > a single index but there are options - we can do multiple indices (high
> > cost), we could do multiple value blocks (less duplication, but only one
> > graph accessible at a time; kindof like two-indices-in-one), we could
> > optimise only one graph for locality.
> 
> As described above, I think the multiple-node-offsets per key in
> the TKI with different graph sections is the way to go. Especially
> given the duplication in the keyspace present in iix and rix.
> 
> > >  * Perhaps use length + compressed XML hunk for the preamble
> > >    instead of the struct-like representation above.
> >
> > Please no. We're haven't added xml anywhere for quite some time.
> 
> Sure.
> 
> > > That's all for now; I'm going to proceed and implement the above.
> > > Comments are appreciated, nothing is set in stone.
> > >
> > > Thanks to Aaron and Robert for their patience with me while
> > > explaining things on IRC.
> >
> > Happy to discuss, and I'm really glad to see you've put serious thought
> > into this design.
> >
> > A few more notes - theres no need to give this its own signature and
> > classnames - just replace the toy index I wrote. The toy is documented
> > as being unstable and to-be-replaced. Its disk format has already
> > changed several times since it was merged to mainline.
> 
> I'll re-implement the current interface using
> CompactGraph and Dictionary.
> 
> > I think a useful thing to do before all the code is written is to do
> > some modelling: how big will an index be given sample data (e.g.
> > bzr.dev's knits). How many separate read operations are needed to
> > perform e.g. checkout, or commit, with this index. And so on.
> > Instrumenting the current index might be a good way to findout the
> > access patterns that are made. I suggest this because it takes time to
> > write and tune code; some modelling can help validate the design before
> > getting to the code.
> 


> Keir
> 
> p.s. i'm not sold on the perfect hashes yet. they seem like they will
> be slow without a C implementation, at least for the creation stage.
> 
> 
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070908/8aff0240/attachment.pgp 


More information about the bazaar mailing list