[RFC] set based apis...

Thu Jul 19 03:40:17 BST 2007

On Wed, 2007-07-18 at 21:59 -0400, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Ian Clatworthy wrote:
> > Generally speaking, what you're proposing is great policy. If there is
> > no need to force a constraint on an API, don't do it because it
> > restricts implementation choices.
> 
> Sure it's great in theory, but there are always reasons for enforcing
> these kind of constraints. If you need something in a particular order,
> and you get it in the wrong order, then you have to compensate, by doing
> your own sorting.
> 
> It would be silly, for example, if repositories provided knit deltas out
> of build order.  There's no advantage to storing them out of build
> order, and it's hard to imagine how they could get out of order in the
> first place.  Yet without a guarantee that they will be returned in the
> correct order, the upper layer must enforce that ordering.
> 
> I would suggest that there are at least two categories of data that
> should have an enforced ordering, and this ordering should be reflected
> in the repo storage itself:
> 1. Inventory data should be provided in parent-to-child order
> 2. File versions should be provided in a buildable order.

Well the sort of thing I'm looking at right now is things like the
index. But in sort order this will apply to stacked repositories too. We
may well want out of order responses with in-memory sorting taking place
in the StackedRepository object.

An example using the indices though, as that applies without needing
unwritten things.

We have in the pack-based repository a set of packs, each with an
attached index.

To answer 'Repository.revision_graph' on this we have to query each
underlying index. This can be tuned depending on the index contents, so
thats fine. To answer simpler questions like 'what are the parents of
these versions' we could:
 - query the index object for each version.
   This results in a query on each underlying separate index until each
version has been found. Lets say we want 10 results from 10 indices and
they are distributed equally across them. Then 1 result will need to
search 1 index, 1 will need to search 2 indices, 1 3 and so on. So the
total number of index lookups will be 1+2+3..+10 => (10^2 + 10)/2 = 55
queries to find 10 answers, with 45 of them misses - queries where the
desired version is not present at all.
 - query all indices for all the desired versions in parallel.
   This results in one query on the first index for 10 versions, one on
the second for 9 versions and so forth. None of the queries misses.
   => 10 queries.

So I think its clear here that querying for multiple sets of data is a
win. What about output ordering?

I'm proposing that we have two api's for things like the above. Possibly
implemented as one method with a 'sort_order' parameter, or possibly as 
two methods, one sorted and one not.

Either way, for this index query, the 'sorted' call would in the
CombinedGraph class look something like:

if sort_order is not None:
    versions = tuple(versions) #save the desired order
    result = dict(self.iter_entries(versions))
    for version in versions:
        yield version, result[version]
else:
    # the combining index query logic here.

For clarity - I'm not suggesting that we don't have ordered output *when
it matters*. But theres a lot of code where it doesn't matter.
Particularly graph traversal logic - we will often have N searches going
on in parallel and there is a good win to be had by exposing that down
the stack.

-Rob

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070719/e57f2d2d/attachment.pgp