split-inventories, per-directory design sketch

Sun Nov 9 23:23:23 GMT 2008

So, I've been thinking about what it would take to do a per-directory
split that would work efficiently with the api's we have - because if we
have to change a lot to make a new system work we're basically rewriting
the entire stack and thats not good :).

I'll say up front I think this is not as good (in terms of clarity,
potential performance, etc) as the current effort - but its worth
considering how the approach would look anyhow.

If we define a directory as a dict from name to inventory entries, and
we attach to each inventory entry an oracle (e.g. a bloom filter) for
the fileids of its children, then we can avoid reading an entire tree to
find a file id, or determine a miss. This does require scanning the
directory - so as directory size grows, the efficiency will shrink. If
the oracle isn't updatable on deletes and adds of children, then updates
will require scanning a full subtree, which is also a source of scaling
inefficiency. And finally we'd need to consider the size of the oracles.

There are variations - we could have an oracle that maps file ids to
possible subdirectories (e.g. by having a dict mapping the first N (say
64) bits of the hash of fileids of children of this directory to a list
of subdirectory names). This would be updatable, but scales linearly
with the size of the subtrees - so the root node would be getting quite
sizable.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081110/1165778e/attachment.pgp