[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.
Vincent Ladeuil
v.ladeuil+lp at free.fr
Thu Aug 28 09:26:25 BST 2008
>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:
robert> On Wed, 2008-08-27 at 10:27 +0200, Vincent Ladeuil wrote:
robert> --- doc/developers/inventory.txt 2008-01-03 01:40:16 +0000
robert> +++ doc/developers/inventory.txt 2008-08-27 00:42:39 +0000
>>
>> <snip/>
robert> In memory inventories
robert> =====================
robert> @@ -39,9 +41,411 @@
robert> All the xml serialized forms write to and read from a single byte string, whose
robert> hash is then the inventory validator for the commit object.
>>
robert> -journalled
robert> -----------
robert> -
robert> -The in development journalled inventory serializer generates a single byte
robert> -string during serialization, but may require many byte strings to deserialize,
robert> -and these are discovered recursively.
robert> +
robert> +Serialization scaling and future designs
robert> +========================================
robert> +
robert> +Overall efficiency and scaling is constrained by the bottom level structure
robert> +that an inventory is stored as. We have a number of goals we want to achieve:
robert> +
robert> + 1. Allow commit to write less than the full tree's data in to the repository
robert> + in the general case.
>>
>> I'm not sure we can respect that AND get an efficient (speed
>> wise) implementation of tree differences.
>>
>> IME some representation of the full tree is needed when it comes
>> to calculate the differences. There may be solutions to work with
>> partial representation (or build/load them on demand), but I can't
>> imagine them without filling the gap I mentioned earlier.
robert> So, the full tree representation in a fragmented form is found by
robert> reading all the fragments, following the root and out from there.
robert> Compared to our existing plan-and-read its more round trips, but only as
robert> many as the height of the tree (the number of index operations also
robert> increased, but they parallelise as we can query them for all nodes at a
robert> single level in parallel).
Haaaaa ! Far clearer now :-)
robert> Writing less that the full inventory data is
robert> achievable by reusing nodes from an earlier version -
robert> which means having the key for those nodes be
robert> something we can reuse.
Yup, key fact, we can chose an arbitrary canonical form as long
as it is canonical.
This may sounds obvious, but in my two experiences with such data
representations, it's worth pointing out that finding the Right
canonical form has massive impacts on speed and memory
consumption. And that was without touching the rest of the code.
robert> We do that today with byte level diffs by looking up
robert> the previous key based on the compression index. This
robert> is ok, but expose the fragmentation process in a way
robert> that higher levels can use - the canonical form is
robert> still a single long document, so that is what we have
robert> to reconstruct.
Ooookkkkk, we are on the same page regarding the new design now.
>> Said otherwise: I'm sure a huge performance gain can be achieved,
>> do we really need to address the space gain too ? On the other
>> hand working with less data is often a good way to achieve
>> performance gains...
robert> Its a bottle neck for commit today; do we have to fix
robert> it? no, should we aim to? yes.
Ok.
robert> + 4. Allow in memory deltas to be generated directly from the serialised form
robert> + without upcasting to a full in-memory representation or examining every
robert> + path in the tree. Ideally the work performed will be proportional to the
robert> + amount of changes between the trees being compared.
>>
>> Ok for the 'without examining every path in the tree' and 'the
>> work performed will be proportional to the amount of changes
>> between the trees being compared'.
>>
>> Some doubts about the 'without upcasting to a full in-memory
>> representation'. But I can imagine on-demand loading of the full
>> representation (which also means no loading for the common parts).
robert> Right. So if we haven't loaded common parts, we don't have a 'full'
robert> in-memory representation.
Worth mentioning then.
<snip/>
robert> to file ids without reading the entire serialised
robert> + form. This is something that is used by commands such as merge PATH and
robert> + diff -r X PATH.
robert> + 7. Let bzr map file ids to paths without reading the entire serialised form.
robert> + This is used by commands that are presenting output to the user such as
robert> + loggerhed, bzr-search, log FILENAME.
>>
>> Lost in the gap.
robert> I'm not sure what you mean.
Then forget it, I wanted to express that I couldn't validate that
part because I had to make to many hypothesis about its meaning.
<snip/>
>> You don't explain what CHK means. Is is just CHecK as in a hash
>> addressing the path_map ?
robert> Content Hash Key. Which is a generic term for using a
robert> hash (could be any of a large number of hash
robert> functions) to generate the key for content.
Right.
robert> +content_map: CHK to root of id to entry map
robert> +
robert> +map_node: INTERNAL_NODE or LEAF_NODE
>>
>> You later define leaf_node not LEAF_NODE. I'm used to lowercase
>> identifiers for on-terminals and uppercase identifiers to
>> terminals.
>>
>> Also, indenting the continuation lines in the right hand size
>> (RHS) of the rules will make them easier to read.
robert> Ok, Will do in a later pass. I'm not trying for a formal grammar at
robert> this point, more to sketch-in-some-detail.
No problem
robert> +
robert> +(Where TYPE is I for internal or L for leaf).
robert> +
robert> +leaf_node:
robert> +LEAF_NODE_SIGNATURE
robert> +hash_prefix: PREFIX
robert> +HASH\x00KEY\x00 VALUE
>>
>> \x00 is an implementation detail, can we get rid of it for now ?
robert> Well, its not really, as we're sketching out at the
robert> same level as the btree index design stuff; if we say
robert> 'whitespace' then VALUE cant have whitespace; and as
robert> we're talking about CHK addressing, serialisation
robert> matters a lot. It also drives node size at worst-case
robert> fanout (e.g the root of the tree)
Your call, my intuition (backed by previous experiences) is that
we will tune that to death anyway, but it's good to have a
starting point.
>>
>> <snip/>
robert> +Delta
robert> +-----
robert> +
robert> +To generate a delta between two inventories, we first generate a list of
robert> +altered fileids, and then recursively look up their parents to generate their
robert> +old and new file paths.
robert> +
robert> +To generate the list of altered file ids, we do an entry by entry comparison of
robert> +the full contents of every leaf node that the two inventories do not have in
robert> +common. To do this, we start at the root node, and follow every CHK pointer
robert> +that is only in one tree. We can then bring in all the value from the leaf
robert> +nodes and do a set difference to get the altered ones, which we would then
robert> +parse.
>>
>> This sounds a lot like what I'm after but I don't yet get the
>> needed understanding to validate it.
robert> Ok, so - we start at the root node, and deserialise it. This gives us a
robert> list of child node CHK's, as pointers in a tree. e.g.
robert> '0' : HASH1
robert> '1' : HASH2
robert> '2' : HASH3
robert> We then read the other root node, which might give us:
robert> '0' : HASH1
robert> '1' : HASH4
robert> '2' : HASH5
robert> We compare these nodes, and see that the pointer to 0 is equivalent, so
robert> we eliminate that subtree.
Perfect.
It may be worth mentioning the associated CHK property :
- two different subtrees MUST have two different CHKs,
- the probability for a CHK clash is so low we neglect it.
robert> Then on both sides we read the nodes for 1 and 2,
robert> using the CHK's HASH2, HASH3, HASH4 and HASH5 to read
robert> them.
robert> Rinse and repeat till we have leaf nodes not internal
robert> nodes everywhere. Do a final readv of the leaf
robert> nodes, and when we deserialise those we have lists of
robert> inventory entries, one list per node, and each node
robert> is in one of the two trees; we can set difference the
robert> two sides to eliminate common entries - the remainder
robert> is the difference.
robert> We then just need to map the fileids to paths, to get
robert> an in memory inventory delta such as
robert> apply_inventory_delta wants.
I think you can add that to your document, this really filled the
gap for me (I knew the principles of the algorithm itself but
this clearly establish the links with our needs).
robert> +
robert> +Radix tree based inventories
robert> +============================
robert> +
>>
>> <snip/>
>>
>> As summarized below, I lack the knowledge to validate the
>> differences between hash tries and radix trees, nothing sounds
>> wrong while reading though.
robert> Uhm, you might like to read 'Ideal Hash Trees, Phil
robert> Bagwell", a paper on HAMT - Hash Array Mapped Tries,
robert> which has many common elements. The described data
robert> structure there can be considered an ancestor for
robert> what
I'll do that.
robert> I'm proposing with some key differences:
robert> - we need to guarantee one and only one representation for any content
robert> of the tree
robert> - we have write once storage and arbitary value sizes, so the disk
robert> layout stuff in that paper is largely irrelevant - we always have
robert> to write up the tree, so we have a slightly different case to
robert> optimise for.
Again this are very valuable pieces of information worth adding
to your document. Especially the later which I had in mind for a
long time.
<snip/>
robert> - but the problem is ensuring a canonical form.
:-D
robert> I have not got a good answer for that using a strict
robert> HP approach; thus its still sketchy. My suspicion is
robert> that to cap the amount of data to examine on
robert> rebalance we need hierarchical splitting not linear
robert> splitting. And a single HP clearly gives us best
robert> direct access to a given bucket, but at the cost of
robert> rewriting the HP on every commit: so we will be doing
robert> IO proportional to the size of it on every
robert> operation. Smaller HP is better then, and the easiest
robert> way to do that is to permit a tree rather than a
robert> single HP. Which - is what I came up with :).
Worth adding too, this is clearly an aim of the design and should
appears as such.
<snip/>
>>
>> Hmmm. Mixed feelings as a reviewer here. On the one hand you
>> provide a lot of details to guide the implementation which is
>> good. On the other, I unable to validate such details without at
>> least a prototype implementation (I confess I have never
>> implemented such a scheme even if most of the concepts ring
>> bells).
robert> Well, I'm heading towards implementation; but this
robert> needs reasoned validation I feel, because stochastic
robert> testing is insufficient to show correctness;
I think we have sufficient existing repositories to test against:
bzr, open-office, mysql, and we can add more as we get more
knowledge about how various hash functions (used for the CHK)
behave.
robert> and we'll leave users wedged if this layer is buggy.
Clearly unacceptable :D
>> Conclusion
>> ----------
>>
>> At that point, I'm not able to make a informed judgement and I
>> don't know if it's because I lack knowledge about our current
>> implementation or because I can't predict if your proposed
>> implementation will do better (obviously or is inclusive here).
>>
>> I vote BB:tweak again since some points need to be made clearer
>> but the whole explanation is better.
robert> Well, I've made the typo changes - thanks. I'm not
robert> sure what to do about unclear bits, because you said
robert> you didn't know enough, rather than the text was
robert> unclear, skipping steps, or otherwise.
Funny, I was less than happy with my review because I felt I just
gave you typos, and this turns out to be enough for you to give
me the missing bits. And now you feel sorry to not give me enough
back.
I don't pretend I understand everything now, but things are far
clearer, so: rejoice ! :) I still don't understand all the
details on how it will interact with our current implementation,
fur the overall picture is far clearer *and* meet the aims I had
for it.
As I mentioned above you clarify a lot with your explanations
which can be added to the document with really little effort.
A couple of additional remarks now.
IIUC the inventories will now be stored fragmented. In most of
the cases this will allow us to find differences without
rebuilding full in-memory representations. Instead, we will read
the needed parts as needed.
So we are back to a mix a full representation/delta
representations. One question is when to we store full
representation knowing that our canonical form *guarantees* that
this is equivalent to applying all the deltas from the empty tree
of the NULL_REVISION. This can be exercised against existing
repositories to find the better speed/space tradeoff or find the
relevant parameters (HP size, full representation frequency,
etc).
More information about the bazaar
mailing list