[MERGE] Describe a hash trie based inventory

Vincent Ladeuil v.ladeuil+lp at free.fr
Tue Aug 26 08:45:27 BST 2008


>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:

    robert> This updates my docs on inventory with an analysis of what a hash trie
    robert> based inventory might look like/perform like.

Nice work. Even it not final, that's a significant step, so:

BB:tweak (see typos)

   Vincent


<snip/>

    robert> === modified file 'doc/developers/inventory.txt'
    robert> --- doc/developers/inventory.txt	2008-01-03 01:40:16 +0000
    robert> +++ doc/developers/inventory.txt	2008-08-22 04:46:46 +0000

<snip/>

    robert> @@ -39,9 +41,269 @@
    robert>  All the xml serialized forms write to and read from a single byte string, whose
    robert>  hash is then the inventory validator for the commit object.
 

<snip/>

    robert> +Design elements to achieve the goals in a future inventory implementation
    robert> +-------------------------------------------------------------------------

<snip/>

    robert> +
    robert> + * Item_keys_introduced_by is innately a history-using function; we can
    robert> +   reproduce the text-key finding logic by doing a tree diff between any tree
    robert> +   and and older tree - that will limit the amount of data we need to process

s/and and/and an/

<snip/>

    robert> +
    robert> +Hash bucket based inventories
    robert> +=============================
    robert> +
    robert> +Overview
    robert> +--------
    robert> +
    robert> +We store two maps - fileid:inventory_entry and path:fileid, in a stable
    robert> +hash trie, stored in densly packed fragemnts.. We pack keys into the map

s/fragemnts/fragments/


<snip/>

    robert> +Goal satisfaction
    robert> +-----------------
    robert> +
    robert> + 1. Success
    robert> + 2. Success
    robert> + 3. Success
    robert> + 4. Success, though each change will need its parents looked up as well
    robert> +    so it will be proportional to the changes + the directories above
    robert> +    the changed path.
    robert> + 5. Success - looking at the difference against all parents we can determine
    robert> +    new keys without reference to the repository content will be inserted
    robert> +    into.
    robert> + 6. This probably needs needs a path->id map, allowing a 2-step lookup.

s/needs needs/needs/

<snip/>
    robert> +Canonical form
    robert> +--------------
    robert> +

<snip/>

    robert> +The path and content maps are populated simply by serialising every inventory
    robert> +entry and inserting them into both the path map and the content map. The maps
    robert> +start with just a single leaf node with an empty prefix. When a leaf nodes size
    robert> +exceeds the fragment size it is split into N smaller leaf nodes with a longer
    robert> +prefix. The prefixes are the smallest that can be used to create smaller leaf
    robert> +nodes. Often this will be one octet (say from AB to two nodes with ABC and ABE
    robert> +respectively). When an internal node exceeds the size threshold the same
    robert> +process is applied. When removing a node from a from a leaf, the size of the

s/from a from a/from a/

    robert> +pther children pointed at by the containing internal node is checked - if the

s/pther/other/

Have you considered handling the inventory as a flat file-ids list
instead of a path tree ?

This could keep the nice properties of hash tries while
addressing the 10.000 files in one directory corner case as well
as the 'Goal satisfaction point 4' above.

        Vincent



More information about the bazaar mailing list