[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.

Thu Aug 28 09:26:25 BST 2008

>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:

    robert> On Wed, 2008-08-27 at 10:27 +0200, Vincent Ladeuil wrote:
    robert> --- doc/developers/inventory.txt	2008-01-03 01:40:16 +0000
    robert> +++ doc/developers/inventory.txt	2008-08-27 00:42:39 +0000
    >> 
    >> <snip/>
    robert> In memory inventories
    robert> =====================
    robert> @@ -39,9 +41,411 @@
    robert> All the xml serialized forms write to and read from a single byte string, whose
    robert> hash is then the inventory validator for the commit object.
    >> 
    robert> -journalled
    robert> -----------
    robert> -
    robert> -The in development journalled inventory serializer generates a single byte
    robert> -string during serialization, but may require many byte strings to deserialize,
    robert> -and these are discovered recursively.
    robert> +
    robert> +Serialization scaling and future designs
    robert> +========================================
    robert> +
    robert> +Overall efficiency and scaling is constrained by the bottom level structure
    robert> +that an inventory is stored as. We have a number of goals we want to achieve:
    robert> +
    robert> + 1. Allow commit to write less than the full tree's data in to the repository
    robert> +    in the general case. 
    >> 
    >> I'm not sure we can respect that AND get an efficient (speed
    >> wise) implementation of tree differences.
    >> 
    >> IME some representation of the full tree is needed when it comes
    >> to calculate the differences. There may be solutions to work with
    >> partial representation (or build/load them on demand), but I can't
    >> imagine them without filling the gap I mentioned earlier.

    robert> So, the full tree representation in a fragmented form is found by
    robert> reading all the fragments, following the root and out from there.
    robert> Compared to our existing plan-and-read its more round trips, but only as
    robert> many as the height of the tree (the number of index operations also
    robert> increased, but they parallelise as we can query them for all nodes at a
    robert> single level in parallel). 

Haaaaa ! Far clearer now :-)

    robert> Writing less that the full inventory data is
    robert> achievable by reusing nodes from an earlier version -
    robert> which means having the key for those nodes be
    robert> something we can reuse.

Yup, key fact, we can chose an arbitrary canonical form as long
as it is canonical.

This may sounds obvious, but in my two experiences with such data
representations, it's worth pointing out that finding the Right
canonical form has massive impacts on speed and memory
consumption. And that was without touching the rest of the code.

    robert> We do that today with byte level diffs by looking up
    robert> the previous key based on the compression index. This
    robert> is ok, but expose the fragmentation process in a way
    robert> that higher levels can use - the canonical form is
    robert> still a single long document, so that is what we have
    robert> to reconstruct.

Ooookkkkk, we are on the same page regarding the new design now.

    >> Said otherwise: I'm sure a huge performance gain can be achieved,
    >> do we really need to address the space gain too ? On the other
    >> hand working with less data is often a good way to achieve
    >> performance gains...

    robert> Its a bottle neck for commit today; do we have to fix
    robert> it? no, should we aim to? yes.

Ok.

    robert> + 4. Allow in memory deltas to be generated directly from the serialised form
    robert> +    without upcasting to a full in-memory representation or examining every
    robert> +    path in the tree. Ideally the work performed will be proportional to the
    robert> +    amount of changes between the trees being compared.
    >> 
    >> Ok for the 'without examining every path in the tree' and 'the
    >> work performed will be proportional to the amount of changes
    >> between the trees being compared'.
    >> 
    >> Some doubts about the 'without upcasting to a full in-memory
    >> representation'. But I can imagine on-demand loading of the full
    >> representation (which also means no loading for the common parts).

    robert> Right. So if we haven't loaded common parts, we don't have a 'full'
    robert> in-memory representation.

Worth mentioning then.

<snip/>
    robert> to file ids without reading the entire serialised
    robert> +    form. This is something that is used by commands such as merge PATH and
    robert> +    diff -r X PATH.
    robert> + 7. Let bzr map file ids to paths without reading the entire serialised form.
    robert> +    This is used by commands that are presenting output to the user such as
    robert> +    loggerhed, bzr-search, log FILENAME.
    >> 
    >> Lost in the gap.

    robert> I'm not sure what you mean.

Then forget it, I wanted to express that I couldn't validate that
part because I had to make to many hypothesis about its meaning.

<snip/>

    >> You don't explain what CHK means. Is is just CHecK as in a hash
    >> addressing the path_map ?

    robert> Content Hash Key. Which is a generic term for using a
    robert> hash (could be any of a large number of hash
    robert> functions) to generate the key for content.

Right.

    robert> +content_map: CHK to root of id to entry map
    robert> +
    robert> +map_node: INTERNAL_NODE or LEAF_NODE
    >> 
    >> You later define leaf_node not LEAF_NODE. I'm used to lowercase
    >> identifiers for on-terminals and uppercase identifiers to
    >> terminals.
    >> 
    >> Also, indenting the continuation lines in the right hand size
    >> (RHS) of the rules will make them easier to read.

    robert> Ok, Will do in a later pass. I'm not trying for a formal grammar at
    robert> this point, more to sketch-in-some-detail.

No problem

    robert> +
    robert> +(Where TYPE is I for internal or L for leaf).
    robert> +
    robert> +leaf_node:
    robert> +LEAF_NODE_SIGNATURE
    robert> +hash_prefix: PREFIX
    robert> +HASH\x00KEY\x00 VALUE
    >> 
    >> \x00 is an implementation detail, can we get rid of it for now ?

    robert> Well, its not really, as we're sketching out at the
    robert> same level as the btree index design stuff; if we say
    robert> 'whitespace' then VALUE cant have whitespace; and as
    robert> we're talking about CHK addressing, serialisation
    robert> matters a lot. It also drives node size at worst-case
    robert> fanout (e.g the root of the tree)

Your call, my intuition (backed by previous experiences) is that
we will tune that to death anyway, but it's good to have a
starting point.

    >> 
    >> <snip/>
    robert> +Delta
    robert> +-----
    robert> +
    robert> +To generate a delta between two inventories, we first generate a list of
    robert> +altered fileids, and then recursively look up their parents to generate their
    robert> +old and new file paths.
    robert> +
    robert> +To generate the list of altered file ids, we do an entry by entry comparison of
    robert> +the full contents of every leaf node that the two inventories do not have in
    robert> +common. To do this, we start at the root node, and follow every CHK pointer
    robert> +that is only in one tree. We can then bring in all the value from the leaf
    robert> +nodes and do a set difference to get the altered ones, which we would then
    robert> +parse.
    >> 
    >> This sounds a lot like what I'm after but I don't yet get the
    >> needed understanding to validate it.

    robert> Ok, so - we start at the root node, and deserialise it. This gives us a
    robert> list of child node CHK's, as pointers in a tree. e.g.
    robert> '0' : HASH1
    robert> '1' : HASH2
    robert> '2' : HASH3

    robert> We then read the other root node, which might give us:
    robert> '0' : HASH1
    robert> '1' : HASH4
    robert> '2' : HASH5

    robert> We compare these nodes, and see that the pointer to 0 is equivalent, so
    robert> we eliminate that subtree.

Perfect.

It may be worth mentioning the associated CHK property :

- two different subtrees MUST have two different CHKs,

- the probability for a CHK clash is so low we neglect it.

    robert> Then on both sides we read the nodes for 1 and 2,
    robert> using the CHK's HASH2, HASH3, HASH4 and HASH5 to read
    robert> them.

    robert> Rinse and repeat till we have leaf nodes not internal
    robert> nodes everywhere.  Do a final readv of the leaf
    robert> nodes, and when we deserialise those we have lists of
    robert> inventory entries, one list per node, and each node
    robert> is in one of the two trees; we can set difference the
    robert> two sides to eliminate common entries - the remainder
    robert> is the difference.

    robert> We then just need to map the fileids to paths, to get
    robert> an in memory inventory delta such as
    robert> apply_inventory_delta wants.

I think you can add that to your document, this really filled the
gap for me (I knew the principles of the algorithm itself but
this clearly establish the links with our needs).

    robert> +
    robert> +Radix tree based inventories
    robert> +============================
    robert> +
    >> 
    >> <snip/>
    >> 
    >> As summarized below, I lack the knowledge to validate the
    >> differences between hash tries and radix trees, nothing sounds
    >> wrong while reading though.

    robert> Uhm, you might like to read 'Ideal Hash Trees, Phil
    robert> Bagwell", a paper on HAMT - Hash Array Mapped Tries,
    robert> which has many common elements. The described data
    robert> structure there can be considered an ancestor for
    robert> what

I'll do that.

    robert> I'm proposing with some key differences:
    robert>  - we need to guarantee one and only one representation for any content 
    robert>    of the tree
    robert>  - we have write once storage and arbitary value sizes, so the disk
    robert>    layout stuff in that paper is largely irrelevant - we always have 
    robert>    to write up the tree, so we have a slightly different case to
    robert>    optimise for.

Again this are very valuable pieces of information worth adding
to your document. Especially the later which I had in mind for a
long time.

<snip/>

    robert> - but the problem is ensuring a canonical form.

:-D

    robert> I have not got a good answer for that using a strict
    robert> HP approach; thus its still sketchy. My suspicion is
    robert> that to cap the amount of data to examine on
    robert> rebalance we need hierarchical splitting not linear
    robert> splitting. And a single HP clearly gives us best
    robert> direct access to a given bucket, but at the cost of
    robert> rewriting the HP on every commit: so we will be doing
    robert> IO proportional to the size of it on every
    robert> operation. Smaller HP is better then, and the easiest
    robert> way to do that is to permit a tree rather than a
    robert> single HP.  Which - is what I came up with :).

Worth adding too, this is clearly an aim of the design and should
appears as such.

<snip/>

    >> 
    >> Hmmm. Mixed feelings as a reviewer here. On the one hand you
    >> provide a lot of details to guide the implementation which is
    >> good. On the other, I unable to validate such details without at
    >> least a prototype implementation (I confess I have never
    >> implemented such a scheme even if most of the concepts ring
    >> bells).

    robert> Well, I'm heading towards implementation; but this
    robert> needs reasoned validation I feel, because stochastic
    robert> testing is insufficient to show correctness;

I think we have sufficient existing repositories to test against:
bzr, open-office, mysql, and we can add more as we get more
knowledge about how various hash functions (used for the CHK)
behave.

    robert> and we'll leave users wedged if this layer is buggy.

Clearly unacceptable :D

    >> Conclusion
    >> ----------
    >> 
    >> At that point, I'm not able to make a informed judgement and I
    >> don't know if it's because I lack knowledge about our current
    >> implementation or because I can't predict if your proposed
    >> implementation will do better (obviously or is inclusive here).
    >> 
    >> I vote BB:tweak again since some points need to be made clearer
    >> but the whole explanation is better.

    robert> Well, I've made the typo changes - thanks. I'm not
    robert> sure what to do about unclear bits, because you said
    robert> you didn't know enough, rather than the text was
    robert> unclear, skipping steps, or otherwise.

Funny, I was less than happy with my review because I felt I just
gave you typos, and this turns out to be enough for you to give
me the missing bits. And now you feel sorry to not give me enough
back.

I don't pretend I understand everything now, but things are far
clearer, so: rejoice ! :) I still don't understand all the
details on how it will interact with our current implementation,
fur the overall picture is far clearer *and* meet the aims I had
for it.

As I mentioned above you clarify a lot with your explanations
which can be added to the document with really little effort.

A couple of additional remarks now.

IIUC the inventories will now be stored fragmented. In most of
the cases this will allow us to find differences without
rebuilding full in-memory representations. Instead, we will read
the needed parts as needed.

So we are back to a mix a full representation/delta
representations. One question is when to we store full
representation knowing that our canonical form *guarantees* that
this is equivalent to applying all the deltas from the empty tree
of the NULL_REVISION. This can be exercised against existing
repositories to find the better speed/space tradeoff or find the
relevant parameters (HP size, full representation frequency,
etc).