[MERGE] Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation.

Thu Aug 28 13:10:46 BST 2008

On Thu, 2008-08-28 at 10:26 +0200, Vincent Ladeuil wrote:
> 
> IIUC the inventories will now be stored fragmented. In most of
> the cases this will allow us to find differences without
> rebuilding full in-memory representations. Instead, we will read
> the needed parts as needed.

Right.

> So we are back to a mix a full representation/delta
> representations. One question is when to we store full
> representation knowing that our canonical form *guarantees* that
> this is equivalent to applying all the deltas from the empty tree
> of the NULL_REVISION. This can be exercised against existing
> repositories to find the better speed/space tradeoff or find the
> relevant parameters (HP size, full representation frequency,
> etc).

Canonical form isn't delta based at all; I think the penny hasn't quite
dropped all the way into the slot :). Also note that we're not using a
hash partition, there are no delta representations at all. Its history
independent (which is one of the goals) - anything that ever writes
deltas isn't history independent. 

> From there, well, make it work, right, fast :D
> 
> A bit of effort will be needed to instrument the code or write
> adequate scripts (measuring time and repository sizes or any
> more precise points we can think of) against the existing
> repositories so that we can measure improvements.

Sure.

> Do we have some knowledge about how long it takes to replay all
> commits starting from scratch on bzr, openoffice and mysql
> repositories ? If this is too long, can we find a satisfying
> number of revisions to start with (incrementing it as we get
> better performances) ?

well, we don't need to do that to determine behaviour on existing repos,
only the inventory data is needed.

> Also regarding tree representation, it's worth noting that using
> file-ids or paths as keys may not be relevant. Hashes will spread
> differences anyway, different hashes will spread
> differently. This is really a key point about which I'm pretty
> sure we can't draw conclusions without experiences (and I really
> mean it, I had a lot of surprises in my previous experiences).

well, a plain radix tree approach isn't hashing; for paths that may be
better. Stilll working on analysing that.

> In other words, we must first reach the work/right point and
> experiment various hashes (and other details :) to solve the fast
> part.
> 
> That's why, IMHO, no, IMVHO, it may be more important to
> highlight the explanations you gave above than deciding if hash
> tries or radix trees are more appropriate here.

I'll see about working them in.

> P.S.: If it wasn't clear enough, I'd be more than happy to work
> on that subject in any way :D

Cool  :)

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080828/7d7d2819/attachment.pgp