[RFC] proposed user doc for nested trees

Wed May 13 19:04:00 BST 2009

Well, I don't have any similar excuse.  If I sound grumpy, it's because
I am.

>> I think it makes a lot more sense to start with the NestedTreesDesign
>> document which I updated at your request, and look at user-level
>> documentation afterward, as a supplement to it.
> 
> OK.  Most of these questions are things we do need to address in the
> user documentation, in my opinion.  It's true it does serve two
> purposes at the moment by both moving towards being actual
> documentation and also being a framework for discussion.

I don't agree that it should serve as a framework for discussion.  I
don't think we can describe the UI in detail until we've come to grips
with the design.

> But I think
> it is worth aiming for it to be conceptually complete, so that we have
> a clear picture and can aim for a clear design.

I believe that a design document is useful, and I believe in DRY, so I
don't think that we should be sticking anything into the user doc that's
covered by the design document.  That way, if the design changes, we
won't have to update twice as many places.

> Also, there's value in seeing whether the underlying behaviour makes
> sense when you try to explain it to a user

There is, but first you have to decide whether you can realistically
offer that behaviour to users.

> someone like Tim or
> Stephen are going to tend to fade out when you give them a document
> talking about internal api changes but they would like to review the
> user documentation.

Tim did read the document, and he said everything seemed sensible to
him, except section 8, the API changes, which he didn't feel confident
to judge.

> I'm going to raise a few specific questions about the design but I
> feel what's really missing is a description of the principles from
> which it descends.  If they were clear maybe many of these questions
> would be obvious.  (And maybe we should still ask them just to make
> sure the principles really are clear and self consistent, but at least
> we'd probably know what the answers should be.)  You get this in some
> dimensions, like saying up front that you'll recurse, but it's less
> clear with regard to eg where the data is stored.

Okay.  Principles of data storage:
- - No information that would be required to construct a nested tree may
  be stored solely in a WorkingTree
- - Information which may vary from site to site or among different
  branches on the same site must not be stored in versioned data.
- - Repositories must store all the data needed to reconstruct all the
revision trees they store.
- - Repositories may not be stored at any location that could reasonably
be deleted.
- - Branches may not be stored at any location that could reasonably be
moved or deleted
- - The same branch should not be used in multiple nested trees.
- - It must be possible to reconstruct the composite tree shape for any
  commit.
- - All data generated about a revision at commit time must be stored in a
  repository.

> I think we do need
> to make it clear otherwise this may be a mess, with us possibly
> finding lots of bugs, or performance holes, or that the behaviour's
> surprising or inconsistent.
> 
> For instance, NestedTreesDesign starts off by saying it's going to
> describe by-reference Nested Trees, and not by-value nested trees, but
> it doesn't define them.  Sure, I can guess what they mean by analogy
> to programming languages or to memory but that's unreliable.

I was trying to distinguish this document from NestedTreeSupport, which
described both types.

> I know
> you're not making up the terms for the first time now, but it seems to
> me some of the annoying inefficiency on landing these changes is due
> to people not being clear about what they propose or what they've
> agreed to.

I don't know what to say.  I thought I had made it clear that I was
going finish up my original implementation of nested trees.  No one at
the sprint seemed to say "wait- what is that exactly?"  When I felt the
need to discuss its behaviour with end-users, I whipped up
NestedTreesDesign to get them up to speed.

>> The idea that a set of nested trees behaves like a single, larger tree seems relatively easy to grasp.
> 
> So it is, but to me it doesn't seem that's what people are mostly
> asking for with this feature.  They normally don't expect to make
> commits spanning trees or to change most of the subtrees at all; they
> just want an easy way to assemble and reproduce all the dependencies.

If that's really what they want, the approaches taken by config-manager
and scmproj should be sufficient.  I don't think there's a point in a
deeply-integrated NestedTrees system that only provides that level of
functionality.

As a tla & baz developer, I wanted more.  As a PanoMetrics developer, I
  wanted more.  As a Launchpad developer, I still want more.  I want the
guarantee that when I refresh my source tree, via update or merge or
pull, my dependencies are up to date.  Otherwise, you can waste a lot of
time dealing with problems caused by out-of-date dependencies.

It's funny that Stephen describes keeping up to date with libneon,
because libneon, in particular, was a problem for tla & baz.  libneon
didn't maintain API stability, so when you merged from libarch, you had
to make sure you updated your version of libneon, too.

But I would say hackerlab was a much more common problem than libneon.

> In cases like uncommit, pull and merge it really won't, as far as I
> can see, just act like just one tree so I doubt if that's a good model
> to aim for.

I'm confident it can act like one tree with merge and pull. I'd never
have taken an interest in nested trees otherwise.

>> In the initial implementation, recursion into subtrees should be implemented at the highest level possible.
> 
> Highest level of what?

Highest-level operation, in the sense that cmd_commit.run is
higher-level than WorkingTree.commit which is higher-level than
Tree.iter_changes, which is higher-level than os.lstat.

>> Speed of operations involving subtrees is not a major concern, but operations that do not use subtrees must not be observably slower.
> 
> I think, given what we've learned, it would be unwise to ship a
> supported history format that doesn't think hard about performance.

Okay.  I didn't mean to suggest that we would design a poor storage
format.  The main thrust was that we must not slow down non-subtree
operations.

> It's moderately ok to ship something with the actual code being not as
> fast as it could possibly be in the first cut, but the big-O factors
> need to be reasonable because they're hard to shift later if they're
> built into the user model or the data format.

I agree.

>> Locking a NestedTrees should lock all subtrees
> 
> This might be a reasonable implementation strategy, but I'm suspicious
> - particularly for cases like dirstate which
> (https://bugs.launchpad.net/bzr/+bug/98836 sucks)

But this is no worse than if it were one big tree and you took a lock at
the top level.  If 98836 hits, I think it's better that it hit
immediately, not e.g. part way through a commit.

> holds files open while it's locked, and will limit us to perhaps a couple of hundred
> subtrees

Even so, it's been established that dirstate trees don't *need* os
locks.  Someone should fix them.

> and some considerable time to open them.

For many operations, you'll need to open all trees anyhow.  It's just a
question of when you do it.

>> The changes to the storage are described at
>> http://bazaar-vcs.org/NestedTreesDesign#data-storage
> 
> Sorry to be harsh but this doesn't actually define what data will be
> stored.

It says that trees store tree-references, which have a subtree
revision-id, in addition to the data common to all inventory entries.

It says that branches store a mapping from file-id to branch location.

It says that repositories can store tree-references in their inventories.

To me, that's a perfectly adequate definition of what is stored.  What
do you feel is missing?

>  It's a long way from being ready to code and I think that is
> what's needed to make sure we're all comfortable with what's going to
> be added.

Unless you're asking for file format descriptions, I really don't get
what you're asking.

> I'd like more detail there.  For instance, if this is correct:
> 
>  references are stored in the inventory as a new inventory type
> 'subtree' that holds the parent directory and filename (collectively
> defining its path), and the revision id that's present (or an
> indication to use a tag or a branch head?) and the branch that's
> present here (as ... what?)
> 
> How are they stored in the working tree?  In the inventory, or separately?

It's not actually correct.  The inventory type is 'tree-reference', and
it's not new at all:

References are stored in the inventory using the 'tree-reference'
inventory type.  Like all inventory types, tree references have a name
and parent-id (collectively defining its path).  They also record the
revision-id of their subtree.  (They may also use the reserved
revision-id 'head:' to indicate the head revision of the subtree's
branch.)  The subtree revision-id is distinct from their own
last-modified revision.  No information about the branch itself is
stored here.

> For other data format changes we've had, during the design phase, a
> list of the important operations and some kind of indication of how
> fast they should be.

That's completely reasonable for a new repository design, but I'm really
just talking about minor tweaks to the existing formats.

> That seemed to work well.  Maybe we should have
> that here - for instance if you're going to fetch subtree graphs
> during pull, perhaps you need to be able to see those graphs without
> walking the whole inventory for every revision.

I don't see how that relates to this design.

>> Branches store a mapping of file-id to branch location and path.
> 
> Where?

Somewhere in the control data maintained by that branch.  For
BzrBranchFormat8, this is .bzr/branch/references, but this is not a
discussion of file formats, and I don't think it should be.

> Is this redundant with what's in the inventory, because it
> sounds like it.

It's not.  The inventory stores no data about the branch itself, such as
its location, only the revision-id of the subtree.

> Do location and path refer to its location within the
> tree, or the name within the .bzr/branch/branches location, or the
> parent url, or something else?

They are the same kind of locations we use to describe branches
elsewhere.  If a relative location is used, it is interpreted relative
to the branch.

>> Commands
> 
> I'd like to hear about how they're specifically effected, not just a
> list.  For instance, it's not suprising that pull is affected, but
> what's it going to do?

I don't understand.  Pull is not listed in that section, because this is
a list of new commands.  There is a description of what pull is meant to
do in http://bazaar-vcs.org/NestedTreesDesign#pull-and-non-initial-push

>> reference-location
> 
> It's not clear what 'locations' means here.

They are the locations of the branches associated with tree-references.

>> Branching from a nested tree.... "It then recurses into the quxlib directory, and does a branch for that."
> 
> This seems to imply it happens consequent to building the working
> tree.

It's not meant to imply that.  It reflects the current implementation,
which recursively creates branches at the subtree directories whether or
not it also creates working trees.

However, it is a bit out-of-date with current thinking.  Current
thinking is that branches of are stored in .bzr/branches, rather than
the user space.

>> "bzr queries baz.org/dev for the location associated with quxlib-id-"...
> 
> This is a separate query to the server done after building the tree?

Yes.

>>> The only case I've seen for that is when people have a top-level tree
>>> which just does the assembly and nothing else.  It can probably be
>>> delayed; it may be worth noting as a restriction.
>> I have no idea how this design could be modified later so that the
>> presence of b was specified by the top-level directory, not by a.  If
>> that's an important feature, we should rethink this design now.
> 
> I don't think it's necessarily needed, but it's something people have
> in other tools so we should say, for the sake of user review, if we're
> not going to do it.

Okay, we're not going to do it.

> I see you reviewed some other tools, which is great, but not
> configmanager and scmproj.  Maybe you should, or should write down
> what you learned?

Okay.

>>> I'd like, more for the sake of this discussion though it would also
>>> help users, to see how this would be shown by 'bzr status'.

It seems reasonable to expect that files *have* changed in the subtree,
so 'status' in the containing tree would show those changes.  I'm not
sure if anything more is needed initially.

Once status is aware of subtrees, it could show

 M  subtree+

(In case you've forgotten, '+' is the single-character indicator of
tree-references, just as '/' is used for directories and '@' is used for
symlinks.)

>>> I think I'd rather be in the situation where they're all consistent
>>> (and not recursing) but some are lacking a useful option.
>> I would like to be consistent, but I think it's more important to decide
>> the correct behaviour.  Once that's decided, we can figure out how to
>> get to a consistent implementation of that behaviour.
> 
> Fair enough.
> 
> To me, not recursing seems a more conservative choice, and one without
> any real downside except "people might prefer to recurse..."

Sorry, I thought I had made the downside clear-- it makes it much easier
for users to get inconsistent trees and waste their time.

>>>> nested branch locations are not tracked over time
>>> I think we should say here where they are stored.  As non-versioned
>>> data in the branch, like tags?
>> Yes: http://bazaar-vcs.org/NestedTreesDesign#data-storage
> 
> Again I'd like more data here about just what's stored.

I can't think of anything more to say.  It says that we store a mapping
of tree-reference file-ids to branch locations.  It says that relative
branch locations are interpreted relative to the branch's location.  I
don't want to describe file formats, just data model, so there's nothing
else I can think of.

>>>> bzr nested DIR LOCATION
>>> So this seems to me a lot like the issue of managing the push, pull,
>>> etc default locations
>> No, it's managing the reference locations, which are the locations,
>> relative to the top-level branch, of the sub-branches.
> 
> OK; let's get that datum out of email into the documents.  So do we
> store the pull location?  If not, how do the example commands of 'bzr
> pull' know what to do?

We get a location for the top-level branch to pull from in the usual
way.  Either from the user or from a parent location.  We use the
top-level branch's mapping of file-ids to branch locations to determine
where to pull the sub-branches from.

>>> It seems like for nested branches you want to control the
>>> push, pull location for them too.
>> Yes, but only in the normal way.
> 
> I guess, having read the design doc, you mean the normal way in that
> they have real fully-formed branch objects hidden under .bzr/branches/
> and they can have configuration etc.

Right.  The only change I'm proposing to branches is the ability for
them to map file-ids to branch locations.

> This idea ought to be up front.
> 
> I do think, seeing this, that it seems like another case along with
> looms where we're really wanting colocated branches or threads or
> whatever, and perhaps we should think about the larger question a bit
> before going in to this.

I think that normal branches will work, and I don't want to borrow trouble.

> Could we for example have just one file
> describing all of the tip pointers (like 'refs' in git)?

They wouldn't be proper branches, then.  They would lack configuration
data such as parent location.

> Could having
> multiple branches in there cause trouble for naive code that doesn't
> read them

I don't think it could cause trouble.

I think it is highly unlikely that any naive code forbids the existence
of a .bzr/branches directory.  Any code that does discover these
branches can handle them as normal treeless branches.  Any code which
skips .bzr directories when searching for branches will skip them, too.

I think the biggest risk would be a naive garbage collector.

However, we can rev the metadir format if you're concerned.

> or for upgrade

I guess it's possible that upgrade would fail to copy them into
backup.bzr.  I'll look into that.

> or just for performance in needing to open
> multiple medium-weight objects?

I think this is the sort of performance issue that we can improve later
if it turns out to be a problem.

>>> When branching from a branch with tree references, bzr should create local branches.
> 
> This section is a bit confusing on first reading because I wasn't sure
> what was rationale and what was the decision, and it's still not
> completely clear that you mean to put what, whole new .bzr
> directories, or named directories with the contents of a branch?

".bzr/branches/foo" would be a treeless, but otherwise completely normal
branch.  There would be a .bzr/branches/foo/.bzr/branch/format file, for
example.

>>>> To delete the location of a nested branch: bzr nested --delete DIR
>>> Are you then left with a checkout with no branch?
>> No.
> 
> So what does happen?

It's Ian's idea, but I would imagine it deletes the entry from the
containing branch's mapping of file-ids to branch locations.  I don't
think it would also mutate the checkout's .bzr/branch/location file.

>> get_reference_branch: Return branch for tree reference
>> get_reference_info: Provide location and saved file path of tree reference branch
>> set_reference_info
> 
> More detail please?

get_reference_branch:
accept file-id and path parameters, both unicode strings.  return a
Branch implementation corresponding with the input.  On BzrBranch7, this
derives the location of the branch by joining the branch's base with the
supplied path.  On BzrBranch8, this derives the location of the branch
from the branch's mapping of file-ids to branch locations.

get_reference_info: accept a unicode string for a file-id.  Use the
branch's mapping of file-ids to branch location and path to determine
the location and path.  Return the location and path as a tuple of
unicode strings.

set_reference_info: accept a file-id, path and branch location.  Add the
supplied data to the branch's mapping of file-id to path and branch
location.  Return nothing.  As a special case, if None is supplied as
both path and branch location, delete the entry.

>> Merger recurses downward by default.
> 
> Presumably it should fast-forward the subtrees if they're not diverged?

I wasn't planning on that.  I think you would only want that if you were
comfortable with "merge --pull" behavior in general.  Maybe it would be
okay we only did a fast-forward when the target's tip was in the
lefthand ancestry of the source's tip.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoLC4sACgkQ0F+nu1YWqI0IKgCfYyD9e0tk3y+GpDt6DmYsSavV
DXAAoIlcH5oCqDq+xAkX3a25UGwdARP6
=LHyE
-----END PGP SIGNATURE-----