[MERGE] repository docs (cleaner still) and GraphIndex

Sat Jul 14 00:35:02 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Fri, 2007-07-13 at 18:40 -0400, Aaron Bentley wrote:
>> Robert Collins wrote:
>>> On Sat, 2007-07-14 at 01:09 +1000, Robert Collins wrote:
>>>> This merge adds the repository docs writeup, with the index stuff I had
>>>> in it cleaned up and the generic parts factored out to a new developer
>>>> text file. 
>>>>
>>>> I've also added a GraphIndex - a graph aware index which is 20% smaller
>>>> than our knit indices today (at least for me on my repositories).
> 
> Thanks for the review. I appreciate the thoroughness with which you have
> reviewed my grammar. I feel a little tweaked by it too - are all the
> corrections unambiguously better?

The most ambiguous changes are the commas, but I think they make the
sentences clearer.  With "Currently successive iter_entries", it looks
as though "Currently successive" modifies "iter_entries", but
"Currently" actually modifies "will".

>> In fact, it seems like it would be trivial to convert this to use the
>> container format.
> 
> Possibly. I could not see an advantage to using the container format for
> this because:
>  - the data is homogenous
>  - the names field is useless for an index

Can't you store the key as the name?

>  - theres no need to address each index record by anything other than
> bytes.

Well, from this implementation, it didn't look like you were trying to
do that.

>>   Instead of byte offsets, you'd use record numbers,
>> which can be precalculated.
> 
> Uhhh, I don't see that this is true unless you mean that you'd write the
> record number in a header in the pack record; then look for that. If
> thats what you mean, see above about sync-up of the record stream.

Well, it looked like you were going to read the whole file anyhow.

It's just that we've got so many file formats now, I'd hope we can use
existing formats more and more.

>> ^^^ Perhaps a note that the order of indices can affect performance
>> dramatically?
> 
> I think this should go in the class docstring, something like:

Yes, that's good.

>>> +        for index in self._indices:
>>> +            for node in index.iter_all_entries():
>>> +                if node[0] not in seen_keys:
>>> +                    yield node
>>> +                    seen_keys.add(node[0])
>> ^^^ Don't you need to define the results of duplicate keys?  Are the
>> "references" values meaningful when you don't know which index they came
>> from?
> 
> By define do you mean 'check that duplicate keys have the same value' ?

It wasn't clear to me that they would have the same value, because I
thought the references were numeric at this point.

> references are always meaningful because they are returned to the caller
> as keys, not as the internal byte offsets.

Okay.  Missed that one.

>>> +While different indices will likely require different interfaces, we
>>> +should keep these consistent to encourage easy adaption and reuse between
>>> +indices.
>> This seems a bit contradictory.
> 
> I mean to say that while we may have many graph types - GraphIndex,
> Index, WhackyIndex - we should key the various methods as similar as
> possible - e.g.:
> GraphIndexBuilder - add_node(key, references, value)
> IndexBuilder - add_node(key, value)
> WhackyIndexBuilder - add_node(key, whackiness, value)
> 
> Is that clearer ?

Yeah.

> 
>>> +Revert              Revision graph access, Inventory extraction, file text
>>> +                    access.
>> What's the revision graph access for?  Revno to revision id translation?
>> Is inventory extraction necessary?  With fast comparison of historical
>> trees, we should not need it.
> 
> I thought revert was based around 'change to be like revision X', not a
> delta ? What should Revert state here?

You're correct, but the implementation uses iter_changes.  We should be
able to implement iter_changes against any arbitrary historical tree by
comparing the working tree to the basis, and then comparing the
historical tree to the basis.  I'll do anything I can to avoid O(tree)
operations.

>> With the way dotted revnos are defined, aren't we stuck getting the
>> whole graph forever?
> 
> With the use of dominators I believe we can do partial-graph work for
> log.

That's a relief.

>>> +Ideally we can make our data access for commands such as branch to
>>> +dovetail well with the native storage in the repository, in the common
>>> +case. Doing this may require the commands to operate in predictable
>>> +manners.
>> Can you say more?
> 
> If we have two commands which could use the same access pattern, or
> could use a different one, we may get a win by coercing them to the same
> access pattern but optimising it heavily. Its a little speculative, thus
> the 'may' in my prose.

That's much clearer.  Thanks.  Could you tweak the doc to talk about
common access patterns instead of "predictable manners"?

>>> +Stream all data required to recreate revs   branch (lightweight)
>>> +Stream file texts in topological order      bundle
>> ^^^ how is this different from branching?
> 
> branching does not require topological order - generating mp_diffs does.

I was more confused about why "bundle" was "file texts" and "branch" was
"all data required to recreate revs".

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGmAwm0F+nu1YWqI0RAst5AJ9ohv7rE+ZBjtX5EJyd/8lyckH3FwCeIxvE
W2rTdfXY19L3PJjDn0ohVms=
=5qst
-----END PGP SIGNATURE-----