[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.

Thu Aug 28 04:36:44 BST 2008

On Wed, 2008-08-27 at 10:27 +0200, Vincent Ladeuil wrote:

>     robert> --- doc/developers/inventory.txt	2008-01-03 01:40:16 +0000
>     robert> +++ doc/developers/inventory.txt	2008-08-27 00:42:39 +0000
> 
> <snip/>
>     robert>  In memory inventories
>     robert>  =====================
>     robert> @@ -39,9 +41,411 @@
>     robert>  All the xml serialized forms write to and read from a single byte string, whose
>     robert>  hash is then the inventory validator for the commit object.
>  
>     robert> -journalled
>     robert> -----------
>     robert> -
>     robert> -The in development journalled inventory serializer generates a single byte
>     robert> -string during serialization, but may require many byte strings to deserialize,
>     robert> -and these are discovered recursively.
>     robert> +
>     robert> +Serialization scaling and future designs
>     robert> +========================================
>     robert> +
>     robert> +Overall efficiency and scaling is constrained by the bottom level structure
>     robert> +that an inventory is stored as. We have a number of goals we want to achieve:
>     robert> +
>     robert> + 1. Allow commit to write less than the full tree's data in to the repository
>     robert> +    in the general case. 
> 
> I'm not sure we can respect that AND get an efficient (speed
> wise) implementation of tree differences.
>
> IME some representation of the full tree is needed when it comes
> to calculate the differences. There may be solutions to work with
> partial representation (or build/load them on demand), but I can't
> imagine them without filling the gap I mentioned earlier.

So, the full tree representation in a fragmented form is found by
reading all the fragments, following the root and out from there.
Compared to our existing plan-and-read its more round trips, but only as
many as the height of the tree (the number of index operations also
increased, but they parallelise as we can query them for all nodes at a
single level in parallel). Writing less that the full inventory data is
achievable by reusing nodes from an earlier version - which means having
the key for those nodes be something we can reuse. We do that today with
byte level diffs by looking up the previous key based on the compression
index. This is ok, but expose the fragmentation process in a way that
higher levels can use - the canonical form is still a single long
document, so that is what we have to reconstruct.

> Said otherwise: I'm sure a huge performance gain can be achieved,
> do we really need to address the space gain too ? On the other
> hand working with less data is often a good way to achieve
> performance gains...

Its a bottle neck for commit today; do we have to fix it? no, should we
aim to? yes.

>     robert> + 4. Allow in memory deltas to be generated directly from the serialised form
>     robert> +    without upcasting to a full in-memory representation or examining every
>     robert> +    path in the tree. Ideally the work performed will be proportional to the
>     robert> +    amount of changes between the trees being compared.
> 
> Ok for the 'without examining every path in the tree' and 'the
> work performed will be proportional to the amount of changes
> between the trees being compared'.
> 
> Some doubts about the 'without upcasting to a full in-memory
> representation'. But I can imagine on-demand loading of the full
> representation (which also means no loading for the common parts).

Right. So if we haven't loaded common parts, we don't have a 'full'
in-memory representation.

>     robert> + 5. Allow fetch to determine the file texts that need to be pulled to ensure
>     robert> +    that the entire tree can be reconstructed without having to probe every
>     robert> +    path in the tree.
>     robert> + 6. Allow bzr map paths
> 
> Is it 'bzr to map' ?

Yes.

>     robert>      to file ids without reading the entire serialised
>     robert> +    form. This is something that is used by commands such as merge PATH and
>     robert> +    diff -r X PATH.
>     robert> + 7. Let bzr map file ids to paths without reading the entire serialised form.
>     robert> +    This is used by commands that are presenting output to the user such as
>     robert> +    loggerhed, bzr-search, log FILENAME.
> 
> Lost in the gap.

I'm not sure what you mean.

> s/.././

I think my keyboard is glitching or something, way to many
duplicate-regions in these typos.

> <snip/>
>     robert> +Canonical form
>     robert> +--------------
>     robert> +
>     robert> +There are three fragment types for the canonical form:
>     robert> +
>     robert> +root_node: (Perhaps this should be inlined into the revision object).
>     robert> +HASH_INVENTORY_SIGNATURE
>     robert> +path_map: CHK to root of path to id map
> 
> You don't explain what CHK means. Is is just CHecK as in a hash
> addressing the path_map ?

Content Hash Key. Which is a generic term for using a hash (could be any
of a large number of hash functions) to generate the key for content.

>     robert> +content_map: CHK to root of id to entry map
>     robert> +
>     robert> +map_node: INTERNAL_NODE or LEAF_NODE
> 
> You later define leaf_node not LEAF_NODE. I'm used to lowercase
> identifiers for on-terminals and uppercase identifiers to
> terminals.
> 
> Also, indenting the continuation lines in the right hand size
> (RHS) of the rules will make them easier to read.

Ok, Will do in a later pass. I'm not trying for a formal grammar at
this point, more to sketch-in-some-detail.

>     robert> +
>     robert> +(Where TYPE is I for internal or L for leaf).
>     robert> +
>     robert> +leaf_node:
>     robert> +LEAF_NODE_SIGNATURE
>     robert> +hash_prefix: PREFIX
>     robert> +HASH\x00KEY\x00 VALUE
> 
> \x00 is an implementation detail, can we get rid of it for now ?

Well, its not really, as we're sketching out at the same level as the
btree index design stuff; if we say 'whitespace' then VALUE cant have
whitespace; and as we're talking about CHK addressing, serialisation
matters a lot. It also drives node size at worst-case fanout (e.g the
root of the tree)

> 
> <snip/>
>     robert> +Delta
>     robert> +-----
>     robert> +
>     robert> +To generate a delta between two inventories, we first generate a list of
>     robert> +altered fileids, and then recursively look up their parents to generate their
>     robert> +old and new file paths.
>     robert> +
>     robert> +To generate the list of altered file ids, we do an entry by entry comparison of
>     robert> +the full contents of every leaf node that the two inventories do not have in
>     robert> +common. To do this, we start at the root node, and follow every CHK pointer
>     robert> +that is only in one tree. We can then bring in all the value from the leaf
>     robert> +nodes and do a set difference to get the altered ones, which we would then
>     robert> +parse.
> 
> This sounds a lot like what I'm after but I don't yet get the
> needed understanding to validate it.

Ok, so - we start at the root node, and deserialise it. This gives us a
list of child node CHK's, as pointers in a tree. e.g.
'0' : HASH1
'1' : HASH2
'2' : HASH3

We then read the other root node, which might give us:
'0' : HASH1
'1' : HASH4
'2' : HASH5

We compare these nodes, and see that the pointer to 0 is equivalent, so
we eliminate that subtree.

Then on both sides we read the nodes for 1 and 2, using the CHK's HASH2,
HASH3, HASH4 and HASH5 to read them.

Rinse and repeat till we have leaf nodes not internal nodes everywhere.
Do a final readv of the leaf nodes, and when we deserialise those we
have lists of inventory entries, one list per node, and each node is in
one of the two trees; we can set difference the two sides to eliminate
common entries - the remainder is the difference.

We then just need to map the fileids to paths, to get an in memory
inventory delta such as apply_inventory_delta wants.

>     robert> +
>     robert> +Radix tree based inventories
>     robert> +============================
>     robert> +
> 
> <snip/>
> 
> As summarized below, I lack the knowledge to validate the
> differences between hash tries and radix trees, nothing sounds
> wrong while reading though.

Uhm, you might like to read 'Ideal Hash Trees, Phil Bagwell", a paper on
HAMT - Hash Array Mapped Tries, which has many common elements. The
described data structure there can be considered an ancestor for what
I'm proposing with some key differences:
 - we need to guarantee one and only one representation for any content 
   of the tree
 - we have write once storage and arbitary value sizes, so the disk
   layout stuff in that paper is largely irrelevant - we always have 
   to write up the tree, so we have a slightly different case to
   optimise for.

 The tree we build is essentially a tree of hash partitions, because
we're optimising to avoid large writes on updates. Note that our minimum
size we can look at here is the CHK size (say 44bytes, if we use hex+a
type field, or 24 if binary + a type string). So if every bucket we have
needs 44 bytes to point at it, 4K of data gets us 100 buckets - and if
we have 200 bytes per child, 20 entries per bucket. So a 4K root
partition map could give us 2000 files with 2-layer reads - but the
problem is ensuring a canonical form. I have not got a good answer for
that using a strict HP approach; thus its still sketchy. My suspicion is
that to cap the amount of data to examine on rebalance we need
hierarchical splitting not linear splitting. And a single HP clearly
gives us best direct access to a given bucket, but at the cost of
rewriting the HP on every commit: so we will be doing IO proportional to
the size of it on every operation. Smaller HP is better then, and the
easiest way to do that is to permit a tree rather than a single HP.
Which - is what I came up with :).

> 

>     robert> +   width is already 1 - the minimum).
>     robert> +   To shrink the prefix of an internal node, create new internal nodes for each
>     robert> +   new prefix, and populate them with the content of the nodes which were
>     robert> +   formerly linked. (This will normally bubble down due to keeping densely
>     robert> +   packed nodes).
>     robert> +   To shrink the prefix of a leaf node, create an internal node with the same
>     robert> +   prefix, then choose a width for the internal node such that the contents 
>     robert> +   of the leaf all fit into new leaves obeying the min_size and max_size rules.
>     robert> +   The largest prefix possible should be chosen, to obey the
>     robert> +   higher-nodes-are-denser rule. That rule also gives room in leaf nodes for 
>     robert> +   growth without affecting the parent node packing.
>     robert> +1. Update the CHK pointers - serialise every altered node to generate a CHK,
>     robert> +   and update the CHK placeholder in the nodes parent; then reserialise the
>     robert> +   parent. CHK pointer propogation can be done lazily when many updates are
>     robert> +   expected.
>     robert> +
> 
> Hmmm. Mixed feelings as a reviewer here. On the one hand you
> provide a lot of details to guide the implementation which is
> good. On the other, I unable to validate such details without at
> least a prototype implementation (I confess I have never
> implemented such a scheme even if most of the concepts ring
> bells).

Well, I'm heading towards implementation; but this needs reasoned
validation I feel, because stochastic testing is insufficient to show
correctness; and we'll leave users wedged if this layer is buggy.

> Conclusion
> ----------
> 
> At that point, I'm not able to make a informed judgement and I
> don't know if it's because I lack knowledge about our current
> implementation or because I can't predict if your proposed
> implementation will do better (obviously or is inclusive here).
> 
> I vote BB:tweak again since some points need to be made clearer
> but the whole explanation is better.

Well, I've made the typo changes - thanks. I'm not sure what to do about
unclear bits, because you said you didn't know enough, rather than the
text was unclear, skipping steps, or otherwise.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080828/33c581b9/attachment-0001.pgp