[RFC] proposed user doc for nested trees

Wed May 13 10:46:12 BST 2009

2009/5/12 Aaron Bentley <aaron at aaronbentley.com>:

If this sounds grumpy or terse, and plead the poor excuse that I'm
trying to get it out before leaving for Allhands in the hope it will
help get any remaining issues cleared up and you unblocked to write
something cool.

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Martin Pool wrote:
>> I liked the document a lot, both as something that can eventually
>> become user documentation, and as a way to clarify the conversation
>> about what we are going to merge into bzr.
>
> I think it makes a lot more sense to start with the NestedTreesDesign
> document which I updated at your request, and look at user-level
> documentation afterward, as a supplement to it.

OK.  Most of these questions are things we do need to address in the
user documentation, in my opinion.  It's true it does serve two
purposes at the moment by both moving towards being actual
documentation and also being a framework for discussion.  But I think
it is worth aiming for it to be conceptually complete, so that we have
a clear picture and can aim for a clear design.

Also, there's value in seeing whether the underlying behaviour makes
sense when you try to explain it to a user: someone like Tim or
Stephen are going to tend to fade out when you give them a document
talking about internal api changes but they would like to review the
user documentation.

>> I think this document needs a bit more detail about what's happening
>> behind the ui, and so does this conversation if it's to move forward
>> smoothly.
>
> Is that detail that is missing from NestedTreesDesign?

In some cases yes; specifics below, so some of the quotes are from there.

I'm going to raise a few specific questions about the design but I
feel what's really missing is a description of the principles from
which it descends.  If they were clear maybe many of these questions
would be obvious.  (And maybe we should still ask them just to make
sure the principles really are clear and self consistent, but at least
we'd probably know what the answers should be.)  You get this in some
dimensions, like saying up front that you'll recurse, but it's less
clear with regard to eg where the data is stored.  I think we do need
to make it clear otherwise this may be a mess, with us possibly
finding lots of bugs, or performance holes, or that the behaviour's
surprising or inconsistent.

For instance, NestedTreesDesign starts off by saying it's going to
describe by-reference Nested Trees, and not by-value nested trees, but
it doesn't define them.  Sure, I can guess what they mean by analogy
to programming languages or to memory but that's unreliable.  I know
you're not making up the terms for the first time now, but it seems to
me some of the annoying inefficiency on landing these changes is due
to people not being clear about what they propose or what they've
agreed to.

> The idea that a set of nested trees behaves like a single, larger tree seems relatively easy to grasp.

So it is, but to me it doesn't seem that's what people are mostly
asking for with this feature.  They normally don't expect to make
commits spanning trees or to change most of the subtrees at all; they
just want an easy way to assemble and reproduce all the dependencies.

In cases like uncommit, pull and merge it really won't, as far as I
can see, just act like just one tree so I doubt if that's a good model
to aim for.

> In the initial implementation, recursion into subtrees should be implemented at the highest level possible.

Highest level of what?

> Speed of operations involving subtrees is not a major concern, but operations that do not use subtrees must not be observably slower.

I think, given what we've learned, it would be unwise to ship a
supported history format that doesn't think hard about performance.
It's moderately ok to ship something with the actual code being not as
fast as it could possibly be in the first cut, but the big-O factors
need to be reasonable because they're hard to shift later if they're
built into the user model or the data format.

> Locking a NestedTrees should lock all subtrees

This might be a reasonable implementation strategy, but I'm suspicious
- particularly for cases like dirstate which
(https://bugs.launchpad.net/bzr/+bug/98836 sucks) holds files open
while it's locked, and will limit us to perhaps a couple of hundred
subtrees, and some considerable time to open them.  (Whether more are
needed is maybe debatable.)

>> You cover it pretty well in some places by say what is or
>> isn't stored, but I think for users to really understand this they
>> need a model of what's recorded in the committed
>> inventories/revisions, what's in the working tree, and what's in the
>> branch/es.
>
> The changes to the storage are described at
> http://bazaar-vcs.org/NestedTreesDesign#data-storage

Sorry to be harsh but this doesn't actually define what data will be
stored.  It's a long way from being ready to code and I think that is
what's needed to make sure we're all comfortable with what's going to
be added.

I'd like more detail there.  For instance, if this is correct:

 references are stored in the inventory as a new inventory type
'subtree' that holds the parent directory and filename (collectively
defining its path), and the revision id that's present (or an
indication to use a tag or a branch head?) and the branch that's
present here (as ... what?)

How are they stored in the working tree?  In the inventory, or separately?

For other data format changes we've had, during the design phase, a
list of the important operations and some kind of indication of how
fast they should be.  That seemed to work well.  Maybe we should have
that here - for instance if you're going to fetch subtree graphs
during pull, perhaps you need to be able to see those graphs without
walking the whole inventory for every revision.

> Branches store a mapping of file-id to branch location and path.

Where? Is this redundant with what's in the inventory, because it
sounds like it.  Do location and path refer to its location within the
tree, or the name within the .bzr/branch/branches location, or the
parent url, or something else?

> Commands

I'd like to hear about how they're specifically effected, not just a
list.  For instance, it's not suprising that pull is affected, but
what's it going to do?  Pull, separately in all nested trees?  From
the same location as the containing tree was pulling from, or
somewhere else?

> reference-location

It's not clear what 'locations' means here.

> Branching from a nested tree.... "It then recurses into the quxlib directory, and does a branch for that."

This seems to imply it happens consequent to building the working
tree.  So if you don't build the working tree, it won't fetch the
nested branch data and you won't later be able to build the working
tree offline?

> "bzr queries baz.org/dev for the location associated with quxlib-id-"...

This is a separate query to the server done after building the tree?

>> Are they branched
>> from the branches in the source, or are they pulled from the same
>> reference URL they originally came from?  (The second seems
>> problematic if you're pushing to a server, because in general we don't
>> assume that the server can go and make outgoing connections on your
>> behalf.)
>>
>> What happens if you have a branch with no working tree?
>
> There is discussion of that in NestedTreesDesign.
>
>> Presumably
>> the fact that the nested trees were there is still present.
>
> Yes.
>
>> Is the
>> data copied correctly in this case?
>
> Yes.

These are the kind of questions that I'm asking not so much for
one-off answers in a mail thread but because they should be
documented.  I hope I'm not being thick or contrary; I think they are
reasonable questions people may be asking at this stage in the
document.

>>  Do they all go into the same
>> repository?
>
> If the containing tree's branch is part of a shared repo, yes.
> Otherwise, no.
>
>>  Will running 'bzr checkout' then reconstruct all the
>> nested trees? (Perhaps obviously yes.)
>
> Yes.
>
>> One question possibly out of scope for this design: some other systems
>> (like configmanager?) let you have the top level tree require creation
>> of ./a and ./a/b without ./a needing to know anything about it.
>
> This design assumes that if a contains b, then a depends on b.
>
>> The only case I've seen for that is when people have a top-level tree
>> which just does the assembly and nothing else.  It can probably be
>> delayed; it may be worth noting as a restriction.
>
> I have no idea how this design could be modified later so that the
> presence of b was specified by the top-level directory, not by a.  If
> that's an important feature, we should rethink this design now.

I don't think it's necessarily needed, but it's something people have
in other tools so we should say, for the sake of user review, if we're
not going to do it.

I see you reviewed some other tools, which is great, but not
configmanager and scmproj.  Maybe you should, or should write down
what you learned?

>
>>> bzr branch --nested
>>
>> I think "remember this branch in the parent" is more of an operation
>> in its own right than someting just done by 'bzr branch'.
>
> I was assuming branch --nested was essentially "bzr branch http://foo &&
>  bzr join --nested foo".

Me too, but Ian's document only talks up front about the branch
command, and I don't think it's enough by itself.

>
>
>> For example
>> you might want to init a new nested tree, or you might already have an
>> untracked nested branch constructed by some other means.  So why not
>> have 'bzr join' or 'bzr add --nested' (though the second seems now
>> discarded?)
>
> We also have bzr join --nested.  I don't consider add to be explicit enough.

ok; that makes sense that overloading add may be too much.

>
>> I'd like, more for the sake of this discussion though it would also
>> help users, to see how this would be shown by 'bzr status'.  If it
>> recurses
>
> http://bazaar-vcs.org/NestedTreesDesign#downwards-recursion specifies
> that status is recursive.
>
>> then it should show you the changes in the nested trees,
>> obviously, but it also seems to need to show that the version of the
>> nested tree in the parent is not what's in that branch.
>
> You mean, in cases where the last-revision in the nested tree has
> changed without the files changing?

I mean that if I do a commit (or pull) in the subtree, the subtree's
working directory basis and branch tip will be different to the
revision presumably recorded in the parent.

>
>> Ideally we would have one consistent rule for all commands as regards
>> descent into nested trees.  If it's "they all recurse" that's great;
>
> It is: http://bazaar-vcs.org/NestedTreesDesign#downwards-recursion
>
>> Also, I think the code changes will be such that commands that aren't
>> explicitly updated won't recurse; that's probably the only sane
>> approach.  So that means that code in bzr that's not updated will
>> default to not recursing, and code in other places (like bzr-gtk or
>> qbzr) won't recurse either, at first.
>
> True.  The issue is that we want to take small steps.  Before we
> consider this feature 'beta', we will ensure consistent behaviour of
> core commands:
> http://bazaar-vcs.org/NestedTreesDesign#scope
>
> And we can add support for additional commands incrementally.
>
>> I think I'd rather be in the situation where they're all consistent
>> (and not recursing) but some are lacking a useful option.
>
> I would like to be consistent, but I think it's more important to decide
> the correct behaviour.  Once that's decided, we can figure out how to
> get to a consistent implementation of that behaviour.

Fair enough.

To me, not recursing seems a more conservative choice, and one without
any real downside except "people might prefer to recurse..."

>
>>> nested branch locations are not tracked over time
>>
>> I think we should say here where they are stored.  As non-versioned
>> data in the branch, like tags?
>
> Yes: http://bazaar-vcs.org/NestedTreesDesign#data-storage

Again I'd like more data here about just what's stored.

>
>>> bzr nested DIR LOCATION
>>
>> So this seems to me a lot like the issue of managing the push, pull,
>> etc default locations
>
> No, it's managing the reference locations, which are the locations,
> relative to the top-level branch, of the sub-branches.

OK; let's get that datum out of email into the documents.  So do we
store the pull location?  If not, how do the example commands of 'bzr
pull' know what to do?

>> It seems like for nested branches you want to control the
>> push, pull location for them too.
>
> Yes, but only in the normal way.

I guess, having read the design doc, you mean the normal way in that
they have real fully-formed branch objects hidden under .bzr/branches/
and they can have configuration etc.

This idea ought to be up front.

I do think, seeing this, that it seems like another case along with
looms where we're really wanting colocated branches or threads or
whatever, and perhaps we should think about the larger question a bit
before going in to this.  Could we for example have just one file
describing all of the tip pointers (like 'refs' in git)?  Could having
multiple branches in there cause trouble for naive code that doesn't
read them, or for upgrade, or just for performance in needing to open
multiple medium-weight objects?

>
>> This section is actually raising a bit of a conceptual question for
>> me: are you saying that the nested branches have their own tip
>> pointer, or that they're really checkouts of branches held somewhere
>> else
>
> The working trees are checkouts, the branches are real branches:
> http://bazaar-vcs.org/NestedTreesDesign#sub-branches

>> When branching from a branch with tree references, bzr should create local branches.

This section is a bit confusing on first reading because I wasn't sure
what was rationale and what was the decision, and it's still not
completely clear that you mean to put what, whole new .bzr
directories, or named directories with the contents of a branch?

>
>>> To delete the location of a nested branch: bzr nested --delete DIR
>>
>> The text seems to imply that does not delete the directory, just
>> forgets the location of its branch.  Are you then left with a checkout
>> with no branch, in which you can't do anything much until you
>> essentially rebind it?
>
> No.

So what does happen?

>
>> I'm most concerned that this will come in when people have related but
>> distinct branches that share file ids, eg if they both started by
>> branching from a common template.  Or you might plausibly have
>> libfoo1.1 and libfoo2.0 that share history.
>
> See
> http://bazaar-vcs.org/NestedTreesDesign#modelling-nested-trees-as-a-composite-tree

OK.  At any rate if the subtrees are identified by their root file id
that rules out having two copies of the same thing for the time being.

> get_reference_branch: Return branch for tree reference
> get_reference_info: Provide location and saved file path of tree reference branch
> set_reference_info

More detail please?

> Merger recurses downward by default.

Presumably it should fast-forward the subtrees if they're not diverged?

-- 
Martin <http://launchpad.net/~mbp/>