[MERGE] repository docs (cleaner still) and GraphIndex

Fri Jul 13 23:40:47 BST 2007

Robert Collins wrote:
> On Sat, 2007-07-14 at 01:09 +1000, Robert Collins wrote:
>> This merge adds the repository docs writeup, with the index stuff I had
>> in it cleaned up and the generic parts factored out to a new developer
>> text file. 
>>
>> I've also added a GraphIndex - a graph aware index which is 20% smaller
>> than our knit indices today (at least for me on my repositories).

So this is a toy implementation then?  I'd expect more in terms of
random access if it were our final implementation.

I'd also want more discussion of the physical file format.

In fact, it seems like it would be trivial to convert this to use the
container format.  Instead of byte offsets, you'd use record numbers,
which can be precalculated.

> +    def add_node(self, key, references, value):
> +        """Add a node to the index.
> +
> +        :param key: The key. keys must be whitespace free utf8.

"Keys must be whitespace-free"...

> +    Currently successive iter_entries/iter_all_entries calls will read the

^^^ "Currently, ..."

> +    entire index each time. Additionally iter_entries calls will read the

^^^ "Additionally, ..."

> +    def __init__(self, indices):
> +        """Create a CombinedGraphIndex backed by indices.
> +
> +        :param indices: The indices to query for data.
> +        """
> +        self._indices = indices

^^^ Perhaps a note that the order of indices can affect performance
dramatically?

> +    def iter_all_entries(self):
> +        """Iterate over all keys within the index
> +
> +        :return: An iterable of (key, reference_lists, value). There is no
> +            defined order for the result iteration - it will be in the most
> +            efficient order for the index.
> +        """
> +        seen_keys = set()
> +        for index in self._indices:
> +            for node in index.iter_all_entries():
> +                if node[0] not in seen_keys:
> +                    yield node
> +                    seen_keys.add(node[0])

^^^ Don't you need to define the results of duplicate keys?  Are the
"references" values meaningful when you don't know which index they came
from?

> +
> +    def iter_entries(self, keys):
> +        """Iterate over keys within the index.
> +
> +        :param keys: An iterable providing the keys to be retrieved.
> +        :return: An iterable of (key, reference_lists, value). There is no
> +            defined order for the result iteration - it will be in the most
> +            efficient order for the index.
> +        """
> +        keys = set(keys)
> +        for node in self.iter_all_entries():
> +            if node[0] in keys:
> +                yield node

You can terminate early if all desired keys have been seen.

> +bzr is moving to a write-once model for repository storage in order to
> +achieve lock-free repositories eventually. In order to support this we are

"In order to support this, ..."

> +While different indices will likely require different interfaces, we
> +should keep these consistent to encourage easy adaption and reuse between
> +indices.

This seems a bit contradictory.

> +Revert              Revision graph access, Inventory extraction, file text
> +                    access.

What's the revision graph access for?  Revno to revision id translation?
Is inventory extraction necessary?  With fast comparison of historical
trees, we should not need it.

> +Log                 Revision graph (entire at the moment) access,
> +                    sometimes status between adjacent revisions. Log of a
> +                    file needs per-file-graph.

With the way dotted revnos are defined, aren't we stuck getting the
whole graph forever?

> +Missing             Revision graph access.

What about revision log access?

> +Ideally we can make our data access for commands such as branch to
> +dovetail well with the native storage in the repository, in the common
> +case. Doing this may require the commands to operate in predictable
> +manners.

Can you say more?

> +=========================================== =========
> +Pattern                                     Commands
> +=========================================== =========
> +Single file text                            annotate, diff
> +Files present in one revision               branch
> +Newest form of files altered by revisions   merge, update?
> +Topological access to file versions/deltas  annotate-uncached
> +Stream all data required to recreate revs   branch (lightweight)
> +Stream file texts in topological order      bundle

^^^ how is this different from branching?

> +Its important to note that knit repositories cannot be regenerated by
> +scanning .knits, so a mapped index is still irreplaceable and must be
> +transmitted on push/pull.

^^^ Or it may be combined with other data, as in bundle format 4.

> +With non listable transports how should the collection of pack/index files

"non-listable transports, ..."

Aaron