Single-keyspace APIs (was Re: RFC: versionedfile overhaul)

Mon Mar 17 23:36:40 GMT 2008

On Mon, 2008-03-17 at 08:14 -0400, Aaron Bentley wrote:
> Robert Collins wrote:
> > Aaron and I agreed on a cautious first step, which is to change the keys
> > used in the VersionedFile interface from strings to tuples of strings;
> > this will be used to create a single 'Knit' for the file texts in a
> > repository, rather than one knit per file-id (with a key for a single
> > text such as (fileid, versionid). Aaron would like to move to a single
> > key-space such as ('text', fileid, versionid); however I think this is
> > significantly harder to do due to the current index layer (and I'm not
> > convinced a single keyspace is really a good idea, but thats a different
> > discussion - we are both agreed that a single keyspace for all file
> > texts _is_ good).
> 
> I think that a single namespace for keys is the right thing.  If we're
> going to go through the pain of migrating to new APIs, let's make them
> the right APIs so that we don't have to go through more pain later.
> 
> Many of our operations require access to multiple datatypes at once:
> 
> - fetch requires access to all repository datatypes
> - generating a bundle requires access to all repository datatypes
> - log -v requires access to revisions, inventories and the revision graph
> - status requires access to inventories and revisions and the revision graph
> - merge requires access to inventories and files and the revision graph
> (and lca merge uses the per-file graph)
> - diff requires access to inventories and files
> - upgrade requires access to all repository datatypes
> 
> If we don't have a single keyspace, we will have, in the worst case, 4x
> as many roundtrips as are necessary to perform the operations.

We don't have a single keyspace in our indices. Assuming one readv per
pack per datatype, we spend many more roundtrips figuring out where to
read data from the pack; because indices require traversing a tree based
on the key.

Combining all the keys into one index makes the index bigger, and as we
have different data we want attached to each key will make the index
logic more complex and the indices themselves bigger, with more data
leading to more round trips during index queries. I'm quite convinced we
would hurt performance doing this at all naively.

> This is API friction, because packs can certainly satisfy the request
> using a single roundtrip for all types-- it's our APIs that would
> prevent it.

Given the caveat above of using indices, sure. I don't mind a constant
overhead though : it doesn't get worse as the history or tree size
increases.

> A unified keyspace would mean that every repository record could carry
> its unique name.  This would move towards our goal of making indices an
> optimization only.

I don't want to make indices an optimisation only; what I want to do is
to make sure they are completely regeneratable from a .pack. This is IMO
quite different: I would expect that given an indexless .pack we would
scan and generate indices before doing any other operations.

> Bundle format 4 uses a unified keyspace, and it has worked out quite
> well.  So there's every reason to believe it would work out well for
> other pack formats.
> 
> So I think the time for a unified keyspace is now.

The tuple based keys we've agreed to use are easily (no api changes
related to keys) extended to use a datatype key prefix. So I don't think
there would be much rework to go the full hog here in future if we just
go to tuples today.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080318/6768753e/attachment.pgp