RFC: scaling-inventory rework current thoughts

Thu Aug 14 00:25:43 BST 2008

On Wed, 2008-08-13 at 17:36 -0500, John Arbash Meinel wrote:

> > So we should analyse the full stack and be clear about the impact of
> > such things. Additionally, if the storage layer *needs* to do binary
> > deltas to have efficient storage, then we undo the goal of doing less
> > work to some degree. So it needs to be considered holistically.
> 
> I agree we need to consider it holistically. I would bring up an item 10 then,
> that transmitting and storing the inventory needs to be bounded on average.
> 
> It has certainly been an issue that we've had inventory.knit dominate the
> overall size of a repository. And unfortunately that sort of info is really
> hard to tease out of pack repositories. (I think the indexes have the
> information, but we have to poke at them and add them together, etc.)
> 
> Anyway, I don't think we should require binary deltas, etc. But we should have
> a specific mention about the size of the stored inventory (and the size for
> transmission if they are separate.)

I think this is a corollary to 1 - if commit writes less than the full
tree most of the time. We could make 1 tighter if we want - like total
aggregate is no more than tree_size + total changes + log(inventory
items + changed_items). But I think 1 is sufficient if we are not trying
to game our own goals :)

> >> The other is just a handle to access the text (file_id, revision_id) is
> >> our "text_key".
> > 
> > We can't assume that a (fileid, revid) key exists a-priori, we could
> > easily change to a different key system for text keys and have that work
> > - we already should always be redirecting through the inventory for the
> > revision we're examining anyhow - except when we're doing per-file graph
> > traversal. And for that we'd need to push the actual reference into the
> > tree data, if we wanted to change the text key definition. (I don't).
> 
> Right. Stuff like "log" doesn't have inventories around to redirect through.
> It can probably get them, but not terribly efficiently. (Goes back to the
> efficient mapping idea. We might extend that to include "efficiently map from
> revision_id, file_id => text)

If an inventory supports 7, then getting an inventory from a revision id
is sufficient - and we already support that (and I don't propose that we
change that).

In fact, for log I'd be mapping up front what we want and using delta
composition as we walk history to get updated paths.

I think we want the requirements to be the smallest sufficient set to
make the entire system hoon - things that can be clearly built IFF we
have the base requirements are not themselves requirements.

> > I'm quite certain we could update a radix tree given an inventory delta
> > to create a new radix tree (satisfies 2). And I'm quite certain we could
> > make packing of internal nodes into named byte sequences that we can
> > point to deterministic, both on initial creation and update (satisfies
> > 3). I don't know that a radix tree is the right structure, but certainly
> > I don't see any reason why we couldn't use one.
> 
> I certainly agree that giving a delta we can create a new radix tree. And I
> believe we could make packing internal nodes deterministic. I'm not sure that
> we can make packing internal nodes deterministic without evaluating the whole
> tree.

I'm not proposing a radix tree at this point, it was simply an example
back in London - a counter to 'we have to do it by dirs'. After the body
of this mail I'll give a worked example for radix trees. {An exercise
for the author}.

> >> There is also the question of what does a delta look like, when you are
> >> using fragments. Do you get a delta-text per fragment? Are the deltas a
> >> logical whole and then fragmented  (rather than fragment and delta, you
> >> delta and fragment). Is the top level of a delta just the list of
> >> fragments that were effected? (Or doesn't it matter because the
> >> fragments would be 'chained', the intermediates will always be updated
> >> when a child is updated.)
> > 
> > As I said above deltas are the inventory deltas we have today - the in
> > memory list of changes.
> 
> then again, is that the "apply_inventory_delta" style of delta, or the
> TreeDelta style, the 'iter_changes' style, or the "knit-delta" style?

I'm working from the apply_inventory_delta - the only thing we have
called 'inventory delta' today :).
I can imagine refining it to aid updating things, but that would be my
starting point.

-Rob

(*) Worked example using radix trees
------------------------------------

This is an example of how to define three methods:
canonical_form(inventory)
delta(inventory, inventory)
apply(canonical_form, delta)

Such that:
canonical_form(A).update(delta(A,B)) == canonical_form(B)
is true for all A,B.

This does not attempt to satisfy all the constraints, its just proving
the capability to satisfy some. (And I think there are multiple
solutions that satisfy all the constraints).

There is a thing called an LPC Trie, which is a level and path
compressed Trie. That basically means that when you have a long run of
bits in a key, you use the run rather than just emitting many internal
nodes. And that you make internal nodes larger as needed so that you
have them dense all the time. (A 8-bit internal trie is going to be very
sparse for our data, and a 1-bit trie wastes a huge amount of nodes). 

So, canonical_form(A) will be generated by:
serialising every inventory entry as we do today, in xml.
inserting them into a LPC Trie by their file id (in its utf8
bytesequence form).
serialising every node of the trie from the bottom up using something
simple. memory node pointers are converted into CHK's during this
process.
Now, is this determinstic - does insertion order determine node contents
(clearly node serialisation can be made deterministic). If we add a
constraint to the usual LPC rules that we must bubble compression
opportunities as high up the tree as possible, then yes. If we don't, I
think it is possible to have minor variations.

delta(A,B) is our current make_inventory_delta helper we have scattered
in bzrlib.

Update is the interesting method.
So, to update we do the following:
- for every file id in the delta, we remove it if it was present, and
insert it if it should be present after the application.
- we then serialise all the inmemory dirty nodes from bottom up

remove is:
 - read the trie in, node at a time, using the CHK's to obtain child
nodes.
 - remove the file id
 - recompress the path back up the tree (which is quite cheap as all you
   have to do is see if a node has become too loosely packed)
 - every node in the path is flagged as dirty
insert is basically the same:
 - serialise the entry
 - add to the trie, which again will dynamically read nodes as needed
 - all the nodes on the path inside the trie are flagged dirty

So, how does this scale? We'll have a mix of very small nodes and very
big nodes. (e.g. PREFIX-N file ids added at the same time will cause a
PREFIX- node with N children - but the density of the children
references will be examined (and counting in decimal is quite a bit less
efficient than in binary) so it will be split into children at that
point.

We'll probably end up with quite a few small nodes; I'll come back to
that in a second.

Writes will create at most len(file_id_length*changed file_ids) new
nodes to write to disk (this is a property of tries in general). The LPC
trie though has much better characteristics. I'd need to do some careful
math to give a proper answer - but it should be obvious that changing
one file will only affect a small number of the nodes comprising a 100K
file tree.

To deal with the many-small-nodes problem, I would just change the rules
of standard LPC tries so that they have different tradeoffs - the stock
LPC trie aims for the densest nodes, but we want
fewer-larger-nodes-until-they-are-to-big.
So I'd say something like 'never split regardless of density until there
are 128 references', and always split regardless of density if there are
more than 2000 references.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080814/8f5d1f1d/attachment.pgp