Feedback from evaluation in a corporate environment

Stephen J. Turnbull stephen at xemacs.org
Thu Jan 7 19:59:57 GMT 2010


Uri Moszkowicz writes:

 > Subversion due to the large repository I'm guessing. I'm posting to the
 > Bazaar mailing list because of all the modern tools, I think Bazaar is the
 > closest to supporting an environment such as this.

I think you vastly underestimate the strength of the git architecture.
Time and time again I've seen the git community imitate a feature
introduced by another project by cobbling together a few scripts.  If
they acquire a following, the implementation gets cleaned up and added
to the core.  (This is an important factor in git's notoriously
baroque UI.)

In particular, the basic git object store is append-only, which would
seem to offer appropriate guarantees of many of the features you need.
Since object names are universal, git also can allow a list of object
repositories.  For example, one or more servers could offer read-only
access via NFS (which has nice properties vis-a-vis NFS locking, which
is a horror show normally) to the 10GB object store, while local
commits, and commits received from private branches, would be stored
privately on the local disk or even a private NFS mount.  Mercurial's
object storage is also append-only, but it's not as flexible as git or
Bazaar.

I don't mean to push git on the Bazaar list, but git is the only one
of the three whose object storage layout and semantics I understand
well.  Hopefully somebody who knows Bazaar better than I do can point
out features of Bazaar that achieve the effects you need.

 >   * Checkouts don't behave like branches

Why would you expect them to?  CVS checkouts don't behave like
branches, SVN checkouts don't behave like branches, and git and
Mercurial checkouts don't even bother to exist, let alone behave like
branches.  The whole philosophy of checkouts is that they are *not*
branches.  Is this just a reference to your request for "lightweight
branches"?

 > For distributed read/write support (master-master), Bazaar needs something
 > like a distributed commit transaction. Bind sort of fits this role but not
 > quite yet. First, you want for repositories to be able to be bound to
 > multiple other repositories, not just one.

OK.

 > Second, not only do you want the commit to succeed in the parent
 > repository before the child but you want the commit to succeed in
 > all the repositories or none of them.

This is a noop.  A commit that succeeds in one repository will succeed
in *all* related repositories once it gets there ("related" meaning
that the parent commit(s) is (are) available in all repositories).

All of the modern DVCSes are based on a *DAG* of commits, with new
heads being created automatically when different commits have the same
parent.  This is completely unlike the centralized VCS (CVCS) model,
where creating a new head requires an explicit branch (or tag -b in
CVS).

What may conflict is the update of the branch reference, the *name* of
the head of a sequence of commits.  In a CVCS, there is no place to
put the commit unless an explicit branch is done, so you need locks
and atomicity guarantees for the commit as well as the branch update.
This is not true in a DVCS.  Different DVCSes have different
strategies here.  Mercurial simply creates a nameless head (actually
it has the name "tip" but this is conceptually a tag, not a branch),
and eventually you will be forced to merge it or otherwise handle it
when you try to communicate with other repositories (because that's
when you need coherency in names of heads).  git goes this one better
with reflogs (but similar restrictions on communication that manifest
somewhat differently, typically as a "rebase disaster").  bzr is
basically like Mercurial, except that IMO it nags you to merge
somewhat earlier.

All of the DVCSes have support for version-controlled multiple heads
in a certain sense (native git branches, Stacked Git, Mercurial 
queues, Bazaar pipelines and looms).  Of course all but native git
branches are more or less restricted, but they are still very useful
in one's personal workspaces, and AIUI both looms and pipelines can be
communicated to other repositories/branches.

 > Third, you'll need some sort of deadlock/livelock avoidance
 > mechanism.

I don't think this is true, any more than you need a deadlock/
livelock avoidance mechanism in the DNS.  The alternative, as with
DNS, is to have downstream systems that are somewhat robust to
propagation lags.  I may not understand your requirement for
synchronicity well enough, of course.

 > I think distributed read/write may be too difficult to support in
 > the near term.

*Distributed* read/write is what DVCS is all about.  You are asking
for system-wide guarantees of *synchronous* read/write meaning all the
repositories are sync'ed.  I agree that current DVCSes are not able to
handle this; I question your implicit assumption that your *development
management* systems are capable of doing their part given an advanced
DVCS that can do what you specify.  OTOH, I think that current DVCSes,
properly used, probably can keep up with the development process you
actually use.

 > There are some features that are needed just for the distributed read
 > support (master-slave) which are also needed for distributed read/write.
 > Repositories need to be chained together for this to work. Suppose
 > you have a master server and a read proxy which is bound to the master. In
 > the DVCS world, you either clone the proxy or check it out. Either way,
 > commit/push doesn't result in updating the master.

Assuming by "proxy bound to master" you mean "bzr bind", I believe
you're misunderstanding.  True, currently I don't think you can
recursively bind to (or checkout from) a branch which is bound to yet
another branch.  However, if you clone the proxy, AIUI a local commit
in the clone does not update the proxy or the master, but a push to
the proxy will update not only the proxy but also the bound master.

 > After every commit, the master then needs to update all of the
 > proxies, which, to avoid a loop, shouldn't result in an update back
 > to the master.

No, the loop will terminate immediately because the master will say
"I've already got that commit", neither the master nor the proxy will
change, and the update protocol terminates.  So the feature you
propose is an optimization, but doesn't really change behavior.

 > It would be nice if the distribution to proxies were non-blocking
 > as well.

It had better be; that's what DVCS means.  Unless I don't understand
what you mean by "blocking" here.

 > With distributed repositories, it is possible that on occasion some
 > or all of the servers will become disconnected and so there needs
 > to be some mechanism for resyncing on reconnection. The options I
 > see here are for the repositories to update on reconnection, a
 > periodic resync, or a check for coherency on checkout/branch. I
 > don't think the first option is compatible with the DVCS philosophy

Well, your whole set of requirements is incompatible with DVCS
philosophy (but Bazaar intends to be more than "just DVCS", so that in
itself is no problem).  However, given those requirements, update on
reconnection seems like the obvious solution.  Coherency check on
checkout/branch is insufficient, you'd really need a coherency check
on every update, so I don't think that idea will work.

 > Support for checkouts is great for scenarios where you don't expect to need
 > to commit to a local repository but there's still one feature missing:
 > lightweight branches. Cloning the whole repository takes way too long on a
 > large repository and consumes expensive disk space.

Mercurial may still require this (its support for "named branches" was
way inadequate as of v1.1, but they're up to 1.4 now so that may have
changed).  But both git and Bazaar allow many branches to share
repositories; cloning a large repository only needs to happen once per
project per developer (at most).

Of course you still may want to avoid that, but I think you are way
overestimating the costs.

 > Developers on these large projects, perhaps even more so than OS
 > projects, want support for private branches and checkouts don't get
 > you that. There should be a path to convert from a lightweight
 > checkout to a lightweight branch, the distinction primarily being
 > where the commits go. GIT seems to have gone the route of shallow
 > clones, where some specified subset of the repository is cloned
 > with paralyzing restrictions.

I think you misunderstand the git development process.  git has
developed that particular feature with those particular restrictions
because it was easy to do, and sufficient for the existing use cases.
It's quite possible that additional features you need could be added
with nominal effort.

The fact that git needed to impose those restrictions should give you
pause, too.  It seems that the distinction between "lightweight
checkout" and "lightweight branch" is more subtle than just "where the
commits go".  git has no trouble with committing to a shallow repo.

More important, as John pointed out, bzr has the concept of stacked
branches, which seem to do exactly what you want.

 > Last, the only tool that I've found that can robustly read a CVS repository
 > is cvs2svn. It has support for Bazaar now but I ran into some problems with
 > it. I was able to convert trunk with history fairly easily into Bazaar but
 > when I told it to include all the branches and tags (it doesn't even support
 > specification or specific ones, you get all or nothing) I killed the process
 > after >1 day of running and an 80GB fast import file, which didn't even
 > appear to be remotely near completion.

In this day and age of terabyte disks, I don't understand why a
one-time cost of 80GB, or even 800GB, sets you back that way.

However, it might be an interesting idea to convert content, not to
fastimport format, but to git object (compressed) format or even git
packs.




More information about the bazaar mailing list