Naive questions re hard-linking repositories

Wed Apr 15 07:22:14 BST 2009

2009/4/15 Ian Clatworthy <ian.clatworthy at canonical.com>:
> Given it takes ~ 4 minutes to branch Emacs outside a shared repo
> and 6 seconds to branch within one, I'd like to better understand
> why we don't just hard link the .bzr/repository directory when
> conditions permit it, e.g. both source and target branch are local
> and on the same filesystem say.

(Ian confirms on irc that 6s is including the time to build the tree
so the multiplier is even larger than you might think; just making a
branch with no tree should be ~0s.)

Well, you can't hardlink the directory because the OS won't let you.
:-) <http://en.wikipedia.org/wiki/Hard_link#Limitations_of_hard_links>
 I presume this is so that the directory tree remains a tree not a
directed possibly-cyclic graph.  As trivia, some Solaris versions
would let root hardlink directories, but this could cause a kernel
panic.

However, we could plausibly hardlink the pack files within it.

I think we need to look at this at several levels (in descending order):

1- how does "I want a new branch and working area" map into the bzr
model, and in particular does it create a new repository and copy the
data, or make a stacked branch, or something else?
2- if you are copying all (or most) of a repository's content locally,
should you walk the whole graph and transfer the data semantically, or
should you just copy the repository's packed-up form similar to cp -r?
3- if you're copying the repository just as a bunch of files should
you in fact make hard links rather than copying it.

The top level possibly gives the bigger win, but the bottom ones are
arguably easier to change and more the topic of your mail.

I think it would be reasonable to have local branch just hardlink all
the pack files and make a new repository.  We would still want an
option, maybe --precise, that walks over the graph, validates it and
copies only what's strictly needed, but that need not be the default.
If they can't be hardlinked (eg because of a filesystem limit) then
you could just copy them.  So the lower bar time for these should be
the time to do 'cp -r' or 'cp -rl' of the repository directories, plus
building the tree.

This has some disadvantages compared to having a shared repository,
because they're only sharing storage at one point in time: once they
start to diverge or if one of them is repacked, they'll start using
more disk space.  Still, it will have saved space at that one
particular point in time, and future access should be no slower than
it would be.

It would take a little care to do this in a clean way, and it would
mean there's another code path by which data is copied between
repositories, therefore the possibility for more testing or different
bugs, but the improvement is potentially quite large.  Unless someone
else sees a problem you could try to do a patch for it.

> More broadly, I guess I'm asking us to revisit our assumptions about
> what branch must do vs what it does now. Shared repositories are
> cool but we ought to have a system that benefits from them, not
> *requires* them for acceptable performance. I don't have the answers,
> or even all the questions, so I thought I'd start back at the basics ...

So up at level #1, the question is why does that branch command even
need to think about copying all the data when the user story is "I
want a new logical branch and working tree."

Early versions of bzr took the approach that a directory holds a
working tree, a branch pointer, and a repository with the history of
that branch.  This concept is still in bzr's dna and the defaults are
oriented towards it.  However, the actual recommended mode at the
moment is: make a shared repository, then make branch directories in
there, typically all with working trees.

Whether you prefer one or the other, the fact that bzr is still
oriented towards a mode of operation that isn't what we generally
recommend is a problem.  I think this lies behind complaints that
there are too many ways to use bzr.  Having the options is not a
problem so much as the lack of a clear normal method on which both the
community and the tool agree.

Another reflection of this: if the developers agreed with bzr's that
the normal thing is to copy the repository when you branch, we
certainly would have done the kind of change you suggest much earlier.

I think a case where having some flexibility does make sense is that
some people with large trees or slow build processes may prefer just
one tree that switches around, whereas others that have many streams
in process and modestly sized trees might like lots of checkouts.
Supporting both is great but it should be clear how you get from one
to the other.

And this basically leads in to
<http://bazaar-vcs.org/DraftSpecs/EasyWorkspaceSetup>.

-- 
Martin <http://launchpad.net/~mbp/>