[RFC] New generic graph storage format

Sat Sep 8 00:40:57 BST 2007

Sorry about the whacky full-quoted response; my mail client decided that
ctrl-V was 'send the mail'.

On Fri, 2007-09-07 at 14:58 -0700, Keir Mierle wrote:
> Get a coffee, you'll be done it before this message.

:)

> I think the design outlined above for graph storage is pretty
> reasonable, and by separating the TKI from the rest it should be
> straightforward to experiment with different designs. I've taken
> my 'vacation' to work on this, and I'd rather write code than
> instrumentation (I know, I should eat my vegtables). I'd rather
> experiment with different implementations than instrument the
> existing code. If someone wants to work on instrumentation in
> parallel with me, that'd be great.

Absolutely you should do the things that interest you; in all my
suggestions I'm really just trying to help ensure that what you end up
creating has as much of a win to it as possible. In the bzr project
we've taken a few steps that didn't consider as much of the
problem-space as we could have - they were all wins in one way or
another, but sometimes regressions too.

> > I think there are two things conflated here. The key being a tuple of
> > strings is not the same as the key being a tuple of subkeys. If you
> > think of a composite index in e.g. postgresql, a key in that index is
> > the entire tuple.
> 
> I am not planning on changing the current GraphIndex interface; I
> am planning on building it with my CompactGraph class.
> 
> Unless we store the raw string keys (rather than the tuples) in a
> separate place (allowing us to store the tuples as (keyID1,
> keyID2) where the mapping key->keyID is stored separately) , I see
> no reason to support tuples of keys in the layer which does
> serialization. It's still just k->v, k->k*. We can build tuples on
> top of that.

I think this is a valid choice, and I'm reassured that the client api
will still use tuple keys. Thanks. I just asked myself why I'd push the
awareness of tuples lower, and its because of several interacting
pressures - key prefix searches are one (but can be solved e.g. by
searching for the partially serialised key - one bytestring +
separator). Another is that much of the redundancy in the current
keyspace comes from the repeated fileid prefix in the tix file, and the
possibility of doing things like a separate key index for all the
graph's might be able to get better results if it (for example) used
pointers to each key element directly in the nodes.

> I think the design will be simpler and clearer this way.
> 
> > Allowing multiple reference lists was a decision taken because of the
> > relative size of keys and references - e.g. for the 0.tix I've uploaded.
> >...
> > Now, the current toy index is not optimally encoded, but a naive split
> > into multiple indices with one graph each will save 6M per index, but
> > duplicate 84M.
> 
> Assuming it's not compressed, which I am convinced (though not yet
> benchmark-convinced) we should do. They keys are highly redundant.
> More on this later.

One thing you may not be aware of is that we are considering using hash
based addressing for some objects in the repository. Martin is doing an
experiment with inventories using this (not because its decided, but
because it was a simple way to do the main work of splitting the
inventory into pieces). This experiment adds .hix, which *at the moment*
is simply key:value mapping, but could quite easily become graph aware
to allow deltas.

> > Putting the graph into the value will inflate the index size
> > substantially.
> > So assuming you are right to force this down to one graph
> > per index, we will have to solve the spatial inflation costs somehow
> > IMO. Commits will write to both the text reconstruction index and the
> > file-graph index so more data to write is more overhead.
> 
> I've thought of two ways to solve this. The first and most obvious
> is what you're already doing, storing the second list of
> references (as byte-offset-in-file).
> 
> However, I have a different suggestion, which may or may not be
> faster. I suggest storing multiple node offsets for a single key
> in the TKI; then having multiple nodes sections corresponding to
> the different offsets. In other words, the TKI will store multiple
> node offsets for a single key (each one corresponding to a
> different graph). For example, the key 'foo' may have node offsets
> 23 and 43, where 23 corresponds to the position of that node's
> data (val+refs) in the first graph. In the second graph, the
> (vals+refs) are at 43. Then the file would take the structure
> 
> preamble 
> TKI (containing key -> (offset_in_graph1, offset_in_graph2))
> graph1
> graph2
> 
> Whether this is a win or not will depend somewhat on the access
> patterns and the data being stored. In the preamble, I can also
> imagine saying 'graph 2 has only references and no data' or 'graph
> 4 has only backrefs into the TKI' to avoid storing the extra byte
> for empty values.

Interesting. So gives locality for each graph traversal at the cost of
duplicating the values stored, or at the cost of requiring both graphs
to be read to get the values while traversing a second graph.

So lets put a few use cases up:
log:
  - reads the .rix history-graph from a given revision back to the start
of time. In future it will probably read only to the first dominator,
and then dominator at a time but thats future. For now - it wants the
full transitive expansion of the graph. What data does it need back? log
does: 
 - get the graph of revid:parents
 - sort that graph to group merges and assign numeric revision ids
 - iterate the texts for the revisions in the sorted list. As reading
the texts from the pack is a round-trip per request we currently use an
iterator so that we can request many texts at once; planning the readv
on the pack therefore requires many of the values from the nodes in the
graph upfront - an iterator is not sufficient - though we could do it in
batches of e.g. 1000.

So log would probably be hurt overall if the value was decoupled from
the graph iteration though it would be possibly be faster to start
logging.

log FILE:
log FILE is the primary user of the .tix index's 1-st graph, the
per-file history graph. log -v does:
 - everything log does to get the overall log shape
 - find the inventory file id and revision id of FILE
 - a transitive graph walk of the file-modification graph but not the
data starting from (fileid, revision)
 - filters the overall revision graph down using the per-file graph
 - shows the results.

commit:
 commit checks the revision graph when dealing with merges
 each file committed has several heads in its per-file graph. We do a
heads() calculation on the last-modified revisions from each of the
parent inventories for the file being committed. We used to do this on
the per-file graph but recently we have proved a theorem that lets us
use the revision graph which should commonly have better locality of
reference. We don't need the value for these queries.

> Back to our main programming.
> 
> I did some experiments to see how much of the keyspace is shared
> among the .iix, .rix, .six, and .tix files. I ignored the .tix
> file, then calculated the size of the union of the keys and how
> much smaller the result would be if only the union was stored.
> 
> I also calculated the overhead which would be incurred, assuming
> a fixed-width row layout for the blocks in the TKI (because blanks
> would have to be stored).

So the keys for the .tix are composite keys using the fileid and
revision id. fileids are not indexed anywhere else today. Considering
just byte strings:
.rix and .iix will have identical keyspaces - all revisions always have
1 inventory with the same key.
.tix will have 0->many keys which have the end of the byte string
matching \NULL + .rix key. (because at the tuple level is (fileid,
revisionid).

> Below are my results for the files found at:
> http://people.ubuntu.com/~robertc/baz2.0/.bzr/repository/indices/

> In summary: it appears there is a clear space gain by merging iix
> and rix. With .six it is not so clear; also, .six is merely a
> dictionary and not really a graph at all.

Yes, .six has no graph information and a specialised key:value index is
appropriate for it at some point.

Combining the revision index (has a commit graph, values give the
location of revision texts), and the inventory index (has two graphs
today, but this is a bug - it should only have one, the compression
graph, and the values give the location of inventory deltas or
fulltexts) seems reasonable to me. We don't have a hard constraint today
that there must be a revision for an inventory - the inventory is
created before the revision - but adding one makes sense to me (though
it would be nice to be able to add the two nodes separately, and have
the builder complain at finish time).

> What I propose is to make a separate class for the TKI, called
> simply Dictionary. Then the TKI is built using a
> DictionaryBuilder.  DictionaryBuilder will be a dictionary
> serializer optimized for our use case (sparse index, compressed
> keys in blocks, etc).  Furthermore, it will make swapping out the
> TKI from the GraphIndex simple.
> 
> Then the .six file ceases to be a GraphIndex (which it isn't) and
> becomes a Dictionary. Or maybe Index. I'm not sure what to call
> it.
> 
> Then instead of 4 files, we have 3 files
> (iix rix) -> ix containing 3 graphs
> six -> sig containing a Dictionary
> tix -> tix containing 2 graphs

Well, I think you can safely ignore six today, just inserting everything
with empty parents if your code requires 1 and only one graph. (not
because this is ideal but because it will let you focus on the stuff
that matters).

> The references are stored as node offsets, such that you do not
> need to go back to the TKI to traverse if you don't care about the
> actual keys, just their values and references.
> 
> > Seems to me this layout has a couple of potential tradeoffs.
> 
> Yes, so many tradeoffs!

:)

> > Referring to nodes by pointers back to the TKI will allow iterating a
> > single node to be:
> > TKI lookup
> > read node
> > TKI lookup for all referenced nodes
> > done
> >
> > Iterating just the values though (which is what text reconstruction
> > needs - we don't actually need the node names of the elements in the
> > path, just the values which contain the things in the pack we need to
> > read):
> > TKI lookup on first node
> > read node
> > start loop
> > TKI lookup for references
> > read nodes
> > loop
> 
> Slow for checkout, merge, etc.
> 
> > Whereas referring to nodes by pointers to the nodes themselves changes
> > this:
> > accessing a single node:
> > TKI lookup
> > read node
> > read referenced nodes
> > TKI access for the name of those nodes
> > done.
> >
> > but iterating just values is now:
> > TKI lookup on first node
> > read node
> > start loop
> > read referenced nodes
> > loop
> 
> This (that if you don't care about the actual keys, you can avoid
> ever knowing them or looking them up) is exactly what I intended
> to describe, though I was not clear.

So the question is what operations are helped/hindered by this
particular tradeoff.

Now, one might say 'find the lowest common ancestor' is helped by not
needing to get the keys back - but the important thing to remember is
that the *common case* for merge which uses LCA is that we have two or
more packs, and thus two or more disk indices to query. So we will need
to be comparing the keys between indices frequently.

Text reconstruction though, which is also part of merge is helped. I
don't have enough info to guess which is a better tradeoff - I think
this is a suck it and see situation.

> > GraphIndex today compresses the node references using dictionary
> > compression. The achieved savings are moderate - the 9 bytes above for
> > each reference are referring to keys that are ~80 bytes long - so a 9:1
> > compression ratio - though there is lots of duplicate text as the sub
> > elements of the tuple keys are not dictionary compressed at all.
> 
> First, let's be clear about what dictionary compression is.  I
> think of LZW as dictionary compression, however I think you mean
> storing the position of something in a dictionary rather than
> storing the something itself.
> 
> Given that, I am doing the same, except storing it as a binary
> unsigned integer of either 1,2, or 4 bytes depending on how big
> the nodes section is.

LZW is a sliding window sequence based dictionary compressor. anything
that outputs something once then refers to it is a form of dictionary
compressor. What I meant is what the current index code does - a key is
output once, then future occurrences of the key are output as a
reference to the original output.

> > The index-for-the index would seem to be roughly what would happen if
> > the existing GraphIndex, rather than starting each node with an encoded
> > tuple started with a bytereference to the location of an encoded
> > standalone tuple. I note this similarity because I considered this for
> > the current GraphIndex, but discarded it because the relative size of
> > the keys and values: keys are >>> value and as the API I was putting
> > together then, which needed to match the existing knit code completely
> > (I was working on thunking across in the first instance) needed to be
> > able to always return the keys, so having the keys separate didn't
> > help... If we can eliminate needing to return keys for some queries then
> > we can look at packing more values into a small area and get locality of
> > reference wins there, which I think your layout will do.
> 
> Yes, there will be performance hits until we can eliminate
> unnecessary key lookups.

See above - I'm not sure how much we can do to make key lookups unneeded
outside of text reconstruction.

> Reading sequentially takes 3s on my laptop, and 1.5s on my
> desktop. Decompression is beating straight up sequential reads by
> 50%.

Oh, I'm convinced that reading compressed data is a win; I was simply
talking about the form of compression :).

> Things are a bit murkier when parts of the file may already be
> hanging around in the system's caches. However, for network
> operations, compression is a clear win. Whether compressing flat
> list blocks is better than alternate packing strategies (tries,
> btree) is still up for debate.

On Ubuntu you can drop the disk cache by echoing 1
to /proc/sys/vm/drop_caches.

> > I would paraphrase your design here as having a sparse index for the
> > TKI. An alternative approach might be to use an LPC trie for the index
> > of TKI blocks. This would not identify specific keys - it would only
> > have enough selectivity to identify the block a TKI entry would be found
> > in. I suspect this trie would be about the same size as the LZW
> > compressed index.
> 
> > Working with it would be much the same, except that
> > rather a linear scan through the list of blocks,
> 
> Hold on! I never said anything about a linear scan. It would be by
> standard binary search; remeber, I am *forcing* the records to be
> fixed-width per-block and in the TKI index, so we never have to
> look at all the data.

Well, if you decrease your selectivity for the TKI index to make it
always a reasonable size, then you can indeed bisect within that and
then the pages. Ok.

> > it would be a trie
> > traversal, and we can expect the first 4K to give at least selectivity
> > to 1/50th of the keyspace (4000/80)
> 
> Compressed by 1:5 this gets you to 4000/16 -> 1/250th. I was
> thinking of letting the index part be as big as 32k. at 32k we
> have 1/2000'th selectivitiy. That's 350 keys per selection size.
> At 16 bytes/key, we have 5600 bytes per block. Which seems
> entirely reasonable.
> 
> Then to retreive any key, it's a sequential 32k read followed by a
> 4k read. Also, we now have the whole TKI sparse index in memory,
> so further reads will be fast; requiring only a single 4k read.
> 
> I'm not saying we have to use this way, just that it's not as
> crazy as it sounds.

Ok, I'm convinced its feasible.

> > will be jumping to another 4K which gives as much incremental
> > improvement again. 700000 keys can be selected from in 4 4K trie page
> > reads (and for remote indices we'd want to read more than 4K at a time,
> > or even structure it with a greater page size though that will impact
> > local disk). (By page here I mean a unit of data that we process, not
> > necessarily implying that there will be wasted space or a requirement
> > that we parse everything in a page).
> >
> > Another alternative approach would be to use a sparse B+ like tree ,
> > which also has the property of pointing to blocks not individual records
> > and can be serialised breadth first giving forward-read processing where
> > the more read up-front the more selectivity we get, without requiring
> > reading of the entire thing.
> 
> My approach stores only pointers to blocks in the sparse index. B+
> trees could work too.
> 
> Separating the TKI into a new Dictionary class will make it easy
> to experiment with new approaches, if we can all agree on the
> two-part GraphIndex structure of TKI and graph data :)

I am agreed that splitting the location-of-key functionality and the
'retrieval-of-data' functionality is sound. I'm not convinced its
necessarily the right thing to do (because the values are so small, the
locality of reference wins by grouping the graph and values may be
completely overshadowed by the need to go back to the text index
entirely with high frequency). There are other bluesky options like -
ignore locality of reference for the graph, flatten the text
reconstruction tree for a given node and store it in-line in the value.
This would result in a larger index but one-node-lookup
per-involved-opack to be able to reconstruct a text.

> As described above, I think the multiple-node-offsets per key in
> the TKI with different graph sections is the way to go. Especially
> given the duplication in the keyspace present in iix and rix.

Its worth a shot for sure :).

I'm currently optimising initial commit performance, I want to get us
down to no more than twice tar.

> p.s. i'm not sold on the perfect hashes yet. they seem like they will
> be slow without a C implementation, at least for the creation stage.

Python is extremely fast at some things, and often they are surprising.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070908/11b7a166/attachment.pgp