[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.

Tue Sep 2 06:11:15 BST 2008

Meta-note, I have had no bb:COMMAND message for this patch, so its
outstanding.

On Thu, 2008-08-28 at 11:29 -0500, John Arbash Meinel wrote:

> One thing which concerns me with Robert's scheme, is just how much
> data gets
> written to disk on a commit. If you change an entry in the second
> level, you
> obviously have to write out that page, and then the root page.
> So if we use 4k pages, that means every commit is 8k.

Right. However, only one line in each of those pages is changed. So its
2 lines of unique text, and 7.9K of duplicate text.

> The downside is that 8kB * 19k revisions (for bzr.dev) = 149 MB. Right
> now my
> whole bzr pack file is 77MB, and I just did checks to see that
> inventory is
> 38.8MB of it. And I felt that inventory being 50% of the storage was
> wasteful.
> I really feel like inventory management should be under 25% of total
> storage.

Its not an apples and oranges test. Whats the uncompressed size of all
the inventories? :). 

> Robert mentioned that we may end up hinting to the storage layer, so
> that it
> can delta-compress these pages. He also mentioned that it really needs
> to be
> considered as part of the whole process, but I haven't specifically
> read it
> in his documentation.

We need the layers to work together well, but we do need the layers to
make sense on their own as well; so its not part of this part of my
design process, but its a blocker for me finishing group compress - we
need to have a really good handle on how these things interact.

> Consider if the CHK is the standard handle to these pages, there
> really is no
> information about which pages should be put together to get good
> compression.
> Now, if we can use whatever prefix we use to determine the page
> layouts
> (file_id, hash(file_id), path) that would at least give us some hints.
> Just to
> make sure I understand correctly... the hash(page) is the key to find
> a page.
> But to determine where an inventory entry is located we use
> hash(file_id).

Yes, thats the sort of thing bubbling around my head. We could also use
the root fileid as a key so pages for the same prefix would have a
similar storage hint.

> How about an example lookup:
> 
> Find the InventoryEntry for file_id = "j-file-id" in revision
> "a-rev-id"
> 
> 1) compute hash_file_id = hash('j-file-id')
> 2) Read the "root node" for "a-rev-id", to get the CHK for the
> content_map
> 3) Read the root page of the content map
> 4) compare hash_file_id to root_page.hash_prefix (which for the root
> is
> probably empty?)

Yes, it is - thats explicit in the docs I believe.

...

> 3) Start following the prefixes in the path_map pages, until you get
> to a
> terminal entry. It isn't stated whether the path_map trie will be
> split up on
> hash(path) or on path. It *sounds* like hash(path), but IMO that may
> have bad
> locality if we want to make it easy to look up multiple neighboring
> paths.
> Testing needed.

I think you could do hash(path) or path, but path will give 'ls' whereas
hash(path) won't. Or we need the storage of a directory entry to
explicitly include its children. I'm leaning to path keys as the best
answer.

> 4) Once you have the file_id for the given path in the given revision,
> look up
> the InventoryEntry via file_id (starting at (4) above).

...

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080902/00fdb920/attachment.pgp