[MERGE] Describe a hash trie based inventory
Vincent Ladeuil
v.ladeuil+lp at free.fr
Tue Aug 26 08:45:27 BST 2008
>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:
robert> This updates my docs on inventory with an analysis of what a hash trie
robert> based inventory might look like/perform like.
Nice work. Even it not final, that's a significant step, so:
BB:tweak (see typos)
Vincent
<snip/>
robert> === modified file 'doc/developers/inventory.txt'
robert> --- doc/developers/inventory.txt 2008-01-03 01:40:16 +0000
robert> +++ doc/developers/inventory.txt 2008-08-22 04:46:46 +0000
<snip/>
robert> @@ -39,9 +41,269 @@
robert> All the xml serialized forms write to and read from a single byte string, whose
robert> hash is then the inventory validator for the commit object.
<snip/>
robert> +Design elements to achieve the goals in a future inventory implementation
robert> +-------------------------------------------------------------------------
<snip/>
robert> +
robert> + * Item_keys_introduced_by is innately a history-using function; we can
robert> + reproduce the text-key finding logic by doing a tree diff between any tree
robert> + and and older tree - that will limit the amount of data we need to process
s/and and/and an/
<snip/>
robert> +
robert> +Hash bucket based inventories
robert> +=============================
robert> +
robert> +Overview
robert> +--------
robert> +
robert> +We store two maps - fileid:inventory_entry and path:fileid, in a stable
robert> +hash trie, stored in densly packed fragemnts.. We pack keys into the map
s/fragemnts/fragments/
<snip/>
robert> +Goal satisfaction
robert> +-----------------
robert> +
robert> + 1. Success
robert> + 2. Success
robert> + 3. Success
robert> + 4. Success, though each change will need its parents looked up as well
robert> + so it will be proportional to the changes + the directories above
robert> + the changed path.
robert> + 5. Success - looking at the difference against all parents we can determine
robert> + new keys without reference to the repository content will be inserted
robert> + into.
robert> + 6. This probably needs needs a path->id map, allowing a 2-step lookup.
s/needs needs/needs/
<snip/>
robert> +Canonical form
robert> +--------------
robert> +
<snip/>
robert> +The path and content maps are populated simply by serialising every inventory
robert> +entry and inserting them into both the path map and the content map. The maps
robert> +start with just a single leaf node with an empty prefix. When a leaf nodes size
robert> +exceeds the fragment size it is split into N smaller leaf nodes with a longer
robert> +prefix. The prefixes are the smallest that can be used to create smaller leaf
robert> +nodes. Often this will be one octet (say from AB to two nodes with ABC and ABE
robert> +respectively). When an internal node exceeds the size threshold the same
robert> +process is applied. When removing a node from a from a leaf, the size of the
s/from a from a/from a/
robert> +pther children pointed at by the containing internal node is checked - if the
s/pther/other/
Have you considered handling the inventory as a flat file-ids list
instead of a path tree ?
This could keep the nice properties of hash tries while
addressing the 10.000 files in one directory corner case as well
as the 'Goal satisfaction point 4' above.
Vincent
More information about the bazaar
mailing list