[MERGE] Describe a hash trie based inventory

Tue Aug 26 08:45:27 BST 2008

>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:

    robert> This updates my docs on inventory with an analysis of what a hash trie
    robert> based inventory might look like/perform like.

Nice work. Even it not final, that's a significant step, so:

BB:tweak (see typos)

   Vincent

<snip/>

    robert> === modified file 'doc/developers/inventory.txt'
    robert> --- doc/developers/inventory.txt	2008-01-03 01:40:16 +0000
    robert> +++ doc/developers/inventory.txt	2008-08-22 04:46:46 +0000

<snip/>

    robert> @@ -39,9 +41,269 @@
    robert>  All the xml serialized forms write to and read from a single byte string, whose
    robert>  hash is then the inventory validator for the commit object.

<snip/>

    robert> +Design elements to achieve the goals in a future inventory implementation
    robert> +-------------------------------------------------------------------------

<snip/>

    robert> +
    robert> + * Item_keys_introduced_by is innately a history-using function; we can
    robert> +   reproduce the text-key finding logic by doing a tree diff between any tree
    robert> +   and and older tree - that will limit the amount of data we need to process

s/and and/and an/

<snip/>

    robert> +
    robert> +Hash bucket based inventories
    robert> +=============================
    robert> +
    robert> +Overview
    robert> +--------
    robert> +
    robert> +We store two maps - fileid:inventory_entry and path:fileid, in a stable
    robert> +hash trie, stored in densly packed fragemnts.. We pack keys into the map

s/fragemnts/fragments/

<snip/>

    robert> +Goal satisfaction
    robert> +-----------------
    robert> +
    robert> + 1. Success
    robert> + 2. Success
    robert> + 3. Success
    robert> + 4. Success, though each change will need its parents looked up as well
    robert> +    so it will be proportional to the changes + the directories above
    robert> +    the changed path.
    robert> + 5. Success - looking at the difference against all parents we can determine
    robert> +    new keys without reference to the repository content will be inserted
    robert> +    into.
    robert> + 6. This probably needs needs a path->id map, allowing a 2-step lookup.

s/needs needs/needs/

<snip/>
    robert> +Canonical form
    robert> +--------------
    robert> +

<snip/>

    robert> +The path and content maps are populated simply by serialising every inventory
    robert> +entry and inserting them into both the path map and the content map. The maps
    robert> +start with just a single leaf node with an empty prefix. When a leaf nodes size
    robert> +exceeds the fragment size it is split into N smaller leaf nodes with a longer
    robert> +prefix. The prefixes are the smallest that can be used to create smaller leaf
    robert> +nodes. Often this will be one octet (say from AB to two nodes with ABC and ABE
    robert> +respectively). When an internal node exceeds the size threshold the same
    robert> +process is applied. When removing a node from a from a leaf, the size of the

s/from a from a/from a/

    robert> +pther children pointed at by the containing internal node is checked - if the

s/pther/other/

Have you considered handling the inventory as a flat file-ids list
instead of a path tree ?

This could keep the nice properties of hash tries while
addressing the 10.000 files in one directory corner case as well
as the 'Goal satisfaction point 4' above.

        Vincent