help wanted: fast indexing core - no bzrlib experience needed

Fri Jul 27 15:36:28 BST 2007

With the latest patch from my index branch, I think I have the full
index api needed to deliver a pack based repository complete. (Well
there are a couple of trivial adapters to go; but they use the api, they
don't change it).

I'm focused on the large scale changes - breaking the correspondence
between 'versionedfile' in the repository api and physical storage and
delivering a 'pack based' repository via that. This repository
refactoring will hopefully be followed up with a number of the deeper
structural changes mooted in London, which our current storage api
prohibits us from exploring.

There is a significant body of work to create a fast and efficient
implementation, better than the toy index I have delivered, which is
needed to realise the potential of this approach, both as a trivial
'better implementation of the same data model', and in its final form
where we e.g. split inventories into much smaller pieces, store multiple
variations of a commit, and other such alterations.

Specifically we need 3 different index types to operate approximately as
efficiently as knit indices do, but without reading the entire index
unconditionally. I have put in place a toy implementation of these types
by having one parameterisable class; but often a tuned implementation is
better. The three combinations we use are:
 - a plain key:value index for signatures
 - a graph index - key:value with 1 node reference list for revisions
and inventories
 - a graph index key:value with 2 node reference lists for file texts

What we want to reach is an implementation of the graph indices, which
are the most heavily used today, where:
 - recent data can be accessed without accessing the entire index
 - ideally single node lookup can be done without a linear scan (but if
it can't, it should not pay the linear scan twice for nodes processed by
the scan)
 - robust against windows line ending issues, or if not robust at least
have any such corruption reversible in some programmatic way.

We also need to solve challenges with combining data from multiple graph
indices such that questions like 'heads' can be answered without having
to read all of every index in the worst case. This may require extending
the api created thus far, or at least some private extensions.

So this is a call for volunteers. I am happy to offer ideas about how I
would implement such a core, but fully expect that any volunteer
stepping up will have their own concepts, and be interested in stepping
in a profiling and benchmarking to make their solution seriously fast!
In other words, if you need advice, I have it ready :), but if you think
you have the answer, I'm happy not to burden you - please get coding :)

Almost *no* bzrlib experience is needed: The indexing layer lives right
down at the bottom of the stack: the only layer it utilises is the
Transport layer, a trivial VFS interface where you can write, open (and
read), or readv, from files.

Hoping to hear from you soon,
Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070728/84b05a3b/attachment.pgp