[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.

Thu Aug 28 17:29:07 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "robert" == Robert Collins <robertc at robertcollins.net> writes:
> 
> <snip/>
> 
>     robert> Canonical form isn't delta based at all;
> 
> May be I mispoked then, what I meant is that the full
> representation is stored as fragments on disk somehow. Accessing
> all the fragments may have a cost. This cost can be reduced by
> putting all the fragments at the same place, eventually
> *duplicating* the same information. As long as the representation
> is canonical this should not be a problem.

As near as I can tell, this is "bzr pack". Specifically, there is no plan to
ever write "one-big-page" with all file-ids stored in a linear sequence.

At best, we would write:

ROOT_PAGE
  internal
  subtree1: HASH1
  subtree2: HASH2
HASH1 PAGE
  internal
  subtree3: HASH3
  ...
HASH2 PAGE
  ...
HASH3 PAGE
 leaf
 file_id X, info
 file_id Y, info
HASH4 PAGE
 ...

So the data would *still* be divided into fragments, they just happen to be
next to eachother. However, as the references are not known at the "index"
layer, you would still need to read the ROOT_PAGE to find the subtree pages,
and then read those pages to find the next level, etc.

One thing which concerns me with Robert's scheme, is just how much data gets
written to disk on a commit. If you change an entry in the second level, you
obviously have to write out that page, and then the root page.
So if we use 4k pages, that means every commit is 8k.

The *really really* good thing is that we never have a "fulltext". In that
whatever didn't change is already written to disk, and is just reused.

The downside is that 8kB * 19k revisions (for bzr.dev) = 149 MB. Right now my
whole bzr pack file is 77MB, and I just did checks to see that inventory is
38.8MB of it. And I felt that inventory being 50% of the storage was wasteful.
I really feel like inventory management should be under 25% of total storage.

Robert mentioned that we may end up hinting to the storage layer, so that it
can delta-compress these pages. He also mentioned that it really needs to be
considered as part of the whole process, but I haven't specifically read it
in his documentation.

Consider if the CHK is the standard handle to these pages, there really is no
information about which pages should be put together to get good compression.
Now, if we can use whatever prefix we use to determine the page layouts
(file_id, hash(file_id), path) that would at least give us some hints. Just to
make sure I understand correctly... the hash(page) is the key to find a page.
But to determine where an inventory entry is located we use hash(file_id).

How about an example lookup:

Find the InventoryEntry for file_id = "j-file-id" in revision "a-rev-id"

1) compute hash_file_id = hash('j-file-id')
2) Read the "root node" for "a-rev-id", to get the CHK for the content_map
3) Read the root page of the content map
4) compare hash_file_id to root_page.hash_prefix (which for the root is
probably empty?)
5) compare the remaining bits of hash_file_id to each prefix listed in the
node. Find the CHK which corresponds to the next bits of hash_file_id
6) Read the corresponding page. If it is an internal node, repeat step 5, If
it is a leaf node, look for (hash_file_id NULL file_id VALUE).
7) Deserialize VALUE to get the InventoryEntry

If we then looked for InventoryEntry of path = "a/valid/path" in revision
"a-rev-id" it would be more:

1) Read the "root node" for "a-rev-id", to get the CHK for the content_map
root and the path_map root.
2) Read both pages, because we know we will need them both.
3) Start following the prefixes in the path_map pages, until you get to a
terminal entry. It isn't stated whether the path_map trie will be split up on
hash(path) or on path. It *sounds* like hash(path), but IMO that may have bad
locality if we want to make it easy to look up multiple neighboring paths.
Testing needed.
4) Once you have the file_id for the given path in the given revision, look up
the InventoryEntry via file_id (starting at (4) above).

  --- A use case here is doing "bzr log 'path/*'" and wanting to find all
      revisions which may have modified any records under that path.
      This is problematic with hash(path) because you need to find the
      directory file_id, map over to its inventory entry, to find all child
      file_ids. I suppose in the common case the paths aren't changing so the
      CHK for the page containing hash(directory path) wouldn't be changing.
      If a file was added to a subdir, I'm not convinced that the CHK
      corresponding to the page for hash(directory_file_id) would change.
      I suppose CHK for hash(subdir_file_id) would (since it has a new
      record.)
      Now, if a file was only modified, CHK of the page containing
      hash(subdir_file_id) may not change. However CHK for the page containing
      hash(file_id) *would*.
      So it would seem that if you keep the set of all file ids currently
      under the given path, and monitor them for changes (add/remove) you can
      update that set as necessary, and then monitor the CHK of pages
      corresponding to hash(file_id) (for all file_ids). And if they don't
      change, you don't need to probe further.

      So I *think* you can still do "bzr log path/*" more efficient than brute
      force. Though it is not *nearly* as obvious as in a
      split-by-directory tree.

> 
> Hmm. Just to be clear, in the example you gave:
> 
>     robert> Ok, so - we start at the root node, and deserialise
>     robert> it. This gives us a list of child node CHK's, as
>     robert> pointers in a tree. e.g.
> 
>     robert> '0' : HASH1
>     robert> '1' : HASH2
>     robert> '2' : HASH3
> 
>     robert> We then read the other root node, which might give us:
>     robert> '0' : HASH1
>     robert> '1' : HASH4
>     robert> '2' : HASH5
> 
> Is the tree associated with HASH0 stored once and shared between
> the two trees or stored for each tree ?
> 
>     Vincent

HASH1 identifies a single block of bytes on disk, so yes it is shared between
each tree.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIttJSJdeBCYSNAAMRAszUAJ9sBe9pnEyefsVZ0sfbScwo6gSjbwCfdB1z
L4N22Y9ROIaMndLyCGkDP+Q=
=2YYQ
-----END PGP SIGNATURE-----