[MERGE] repository docs (cleaner still) and GraphIndex

Sat Jul 14 00:04:10 BST 2007

On Fri, 2007-07-13 at 18:40 -0400, Aaron Bentley wrote:
> Robert Collins wrote:
> > On Sat, 2007-07-14 at 01:09 +1000, Robert Collins wrote:
> >> This merge adds the repository docs writeup, with the index stuff I had
> >> in it cleaned up and the generic parts factored out to a new developer
> >> text file. 
> >>
> >> I've also added a GraphIndex - a graph aware index which is 20% smaller
> >> than our knit indices today (at least for me on my repositories).

Thanks for the review. I appreciate the thoroughness with which you have
reviewed my grammar. I feel a little tweaked by it too - are all the
corrections unambiguously better?

> So this is a toy implementation then?  I'd expect more in terms of
> random access if it were our final implementation.

Its not linked into any repository format, so yes, the index engine is
rather toy, but the API is solid - I'm exercising it now to be sure.

> I'd also want more discussion of the physical file format.

> In fact, it seems like it would be trivial to convert this to use the
> container format.

Possibly. I could not see an advantage to using the container format for
this because:
 - the data is homogenous
 - the names field is useless for an index
 - theres no need to address each index record by anything other than
bytes.
 - and most importantly we'd be adding a requirement to support
arbitrary sync-up with the record stream which does not exist today and
that would affect the container implementation significantly by forcing
it to escape content within each record and translate on reads. Unless
we continued using byte offsets, in which case:

>   Instead of byte offsets, you'd use record numbers,
> which can be precalculated.

Uhhh, I don't see that this is true unless you mean that you'd write the
record number in a header in the pack record; then look for that. If
thats what you mean, see above about sync-up of the record stream.

> > +    def __init__(self, indices):
> > +        """Create a CombinedGraphIndex backed by indices.
> > +
> > +        :param indices: The indices to query for data.
> > +        """
> > +        self._indices = indices
> 
> ^^^ Perhaps a note that the order of indices can affect performance
> dramatically?

I think this should go in the class docstring, something like:
Queries within the combined index will be made against the first index,
and then the second and so on. The order of index's can thus influence
performance significantly. For example, if one index is on local disk
and a second on a remote server, the local disk index should be before
the other in the index list.

In the __init__ docstring I propose to clarify:
An ordered list of indices to query for data.

> > +    def iter_all_entries(self):
> > +        """Iterate over all keys within the index
> > +
> > +        :return: An iterable of (key, reference_lists, value). There is no
> > +            defined order for the result iteration - it will be in the most
> > +            efficient order for the index.
> > +        """
> > +        seen_keys = set()
> > +        for index in self._indices:
> > +            for node in index.iter_all_entries():
> > +                if node[0] not in seen_keys:
> > +                    yield node
> > +                    seen_keys.add(node[0])
> 
> ^^^ Don't you need to define the results of duplicate keys?  Are the
> "references" values meaningful when you don't know which index they came
> from?

By define do you mean 'check that duplicate keys have the same value' ?
references are always meaningful because they are returned to the caller
as keys, not as the internal byte offsets.

> > +
> > +    def iter_entries(self, keys):
> > +        """Iterate over keys within the index.
> > +
> > +        :param keys: An iterable providing the keys to be retrieved.
> > +        :return: An iterable of (key, reference_lists, value). There is no
> > +            defined order for the result iteration - it will be in the most
> > +            efficient order for the index.
> > +        """
> > +        keys = set(keys)
> > +        for node in self.iter_all_entries():
> > +            if node[0] in keys:
> > +                yield node
> 
> 
> You can terminate early if all desired keys have been seen.

Good catch, thanks. Really though we should have bisection or something
here :).

> > +While different indices will likely require different interfaces, we
> > +should keep these consistent to encourage easy adaption and reuse between
> > +indices.
> 
> This seems a bit contradictory.

I mean to say that while we may have many graph types - GraphIndex,
Index, WhackyIndex - we should key the various methods as similar as
possible - e.g.:
GraphIndexBuilder - add_node(key, references, value)
IndexBuilder - add_node(key, value)
WhackyIndexBuilder - add_node(key, whackiness, value)

Is that clearer ?

> > +Revert              Revision graph access, Inventory extraction, file text
> > +                    access.
> 
> What's the revision graph access for?  Revno to revision id translation?
> Is inventory extraction necessary?  With fast comparison of historical
> trees, we should not need it.

I thought revert was based around 'change to be like revision X', not a
delta ? What should Revert state here?

> > +Log                 Revision graph (entire at the moment) access,
> > +                    sometimes status between adjacent revisions. Log of a
> > +                    file needs per-file-graph.
> 
> With the way dotted revnos are defined, aren't we stuck getting the
> whole graph forever?

With the use of dominators I believe we can do partial-graph work for
log.

> 
> > +Missing             Revision graph access.
> 
> What about revision log access?

Thanks.

> > +Ideally we can make our data access for commands such as branch to
> > +dovetail well with the native storage in the repository, in the common
> > +case. Doing this may require the commands to operate in predictable
> > +manners.
> 
> Can you say more?

If we have two commands which could use the same access pattern, or
could use a different one, we may get a win by coercing them to the same
access pattern but optimising it heavily. Its a little speculative, thus
the 'may' in my prose.

> > +=========================================== =========
> > +Pattern                                     Commands
> > +=========================================== =========
> > +Single file text                            annotate, diff
> > +Files present in one revision               branch
> > +Newest form of files altered by revisions   merge, update?
> > +Topological access to file versions/deltas  annotate-uncached
> > +Stream all data required to recreate revs   branch (lightweight)
> > +Stream file texts in topological order      bundle
> 
> ^^^ how is this different from branching?

branching does not require topological order - generating mp_diffs does.

> > +Its important to note that knit repositories cannot be regenerated by
> > +scanning .knits, so a mapped index is still irreplaceable and must be
> > +transmitted on push/pull.
> 
> ^^^ Or it may be combined with other data, as in bundle format 4.

Of course. The context here is talking purely about using a mapped index
as described in the same section; if we consider moving data to the
knits or elsewhere, then the mapped index would not be as described and
that caveat won't apply.

Thanks,
Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070714/d6ed7c9d/attachment.pgp