Centralized storage in bzr

Mon Jun 13 15:57:28 BST 2005

Hi all,

Here are my thoughts on how we could support centralization in bzr
without losing most of the advantages of keeping everything in the
working directory.

So bzr works like Darcs, and in some ways, that's great.  Administration
is pretty simple, and the idea of "All my stuff is in this directory" is
pretty easy to grasp.

But there are also advantages to centralized storage: speed and size.
When pulling a branch, you only need to download stored entities that
aren't already in the centralized store.  So if I had a branch of bzr,
and you'd already downloaded Martin's branch, you'd only need to
download those revisions that I'd committed.  And of course, you only
need one copy of each stored entity, instead of duplicating them as we
currently do.

There's also the administration advantage that you only need to back up
one location that is unlikely to change.

So if optional centralized storage is a good idea, how do we hold on to
as many of the advantages of independent branches as we can?

1. a branch can be specified with one url
2. working trees can be moved around easily with mv
3. working trees can be seamlessly branched using cp
4. branches do not use filesystem properties that NTFS lacks

ONE URL
It's tempting to say "let's just centralize the stores, and keep
everything else where it already is."  If we did that, presumably we'd
keep some data about where to find the stores in the branch.  For remote
users, this can be a problem.  We'd need to convert the filesystem path
to a URL.  This could be difficult, or could be impossible, if the
central store was not remotely accessible.

I think it makes more sense for all branch data (but no working tree
data) to be stored centrally.  So you have:

$ANYPATH/stores/revisions
$ANYPATH/stores/inventories
$ANYPATH/stores/file-contents
$ANYPATH/branches/branch1
$ANYPATH/branches/branch2
$ANYPATH/branches/branch3

So users just have to ensure that $ANYPATH is remotely accessible, and
bzr clients just need to look at $ANYPATH/branches/branch3/../../stores
to get the storage.

It would be nice to have one URL both for local and remote use, but I
think it's okay for there to be one URL for remote use, and one pathname
for working trees.

MOVING AROUND WITH MV
To permit working trees to be moved around freely, I think we need to
introduce a branch-id.  This would be internal-use-only, strictly to
link the working tree to the central store.

When a directory is renamed, the branch name in the $ANYPATH/branches
should also be changed.  (But moving the working tree without renaming
it should not affect the remote pathname)

SEAMLESS BRANCHING USING CP
To permit working trees to be branched with cp, we need to arrange for
bzr to detect copies.  I think this can be handled by noting the
last-committed-revision-id in the working tree.  So:

~/bzr.dev has branch-id af-93-2e
We copy bzr.dev to bzr.dev2
We change bzr.dev2 and commit.
The $ANYPATH/branches/bzr.dev directory is renamed to
$ANYPATH/branches/bzr.dev2.
We change bzr.dev and commit.

Since the last-committed-revision-id doesn't match the revision history
of $ANYPATH/branches/bzr.dev2, a new branch-id is assigned to ~/bzr.dev,
and $ANYPATH/branches/bzr.dev is created.  We can use part of the
revision history from $ANYPATH/branches/bzr.dev2 to create it, or fall
back to scanning predecessor revisions, if necessary.

You'll note that this technique can cause $ANYPATH/branches/bzr.dev to
disappear temporarily.  I don't have a good solution for this.  Maybe we
can just detect that ~/bzr.dev still exists, and so treat ~/bzr.dev as a
new branch from the get-go.

BRANCHES DO NOT USE FILESYSTEM PROPERTIES THAT NTFS LACKS
I believe that nothing described here would be harder to implement under
Windows.  (And yeah, NTFS technically supports hard and symbolic links,
as well as long pathnames.  It's just that they aren't handled well by
windows apps such as Explorer...)

So in sum, we
1. store branch-id and last-revision-id in the working tree.
2. move all branch metadata into central storage

ATM, the only working tree data I can think of is the inventory and the
last-pulled-location.  Pending-merge data too, once it's implemented.

I haven't specified how we determine that a new branch should use
central storage, or detect where the central storage for a branch is, or
whether there can be multiple central locations.  I think we can decide
those issues separately.

Aaron