Feedback from evaluation in a corporate environment

Thu Jan 7 16:12:48 GMT 2010

Hi,
I evaluated Bazaar for use in a corporate environment and just wanted to
provide you all with some (lengthy) feedback with suggestions on what would
be needed for projects like ours. First a little bit about our environment.

  * ~50 developers geographically distributed
  * Using CVS, with one repository per site all kept in sync using a custom
setup
  * ~10GB repository (few binaries but 15 years of history)
  * ~4GB for full checkout, ~500MB for just code needed for regular
development
  * ~500MB for checkout of just code needed for regular development
  * Need to delete files with history on occasion (obliterate)
  * Need to colocate all files in one repository so that they can be
versioned together (which test runs with which version of software)
  * Sandboxes always checked out to NFS servers for easy sharing,
backup/snapshots, and performance under load. Implies that disk space is
expensive.
  * Sandboxes typically split up on multiple NFS servers to isolate unique
backup/snapshot and load requirements

I think environments like this are common in the corporate world and
unfortunately the only relatively modern open source tool that seems to work
in a such an environment today is Subversion. Perforce also seems to be
popular in such setups. There are a few OS projects which also seem to have
a comparably large repository. KDE is one example, which still mostly uses
Subversion due to the large repository I'm guessing. I'm posting to the
Bazaar mailing list because of all the modern tools, I think Bazaar is the
closest to supporting an environment such as this.

A common theme among DVCS tools seems to be cloning the whole repository,
which in cases like this would be a disaster. This architecture does not
scale well at all to larger projects. Bazaar stands out here with
lightweight checkouts but is still missing in other areas.

  * No support for partial checkouts
  * No support for read (or ideally read/write) proxies
  * Checkouts don't behave like branches
  * No project specific hook/plugins
  * Poor support for CVS migration

Browsing through some Bazaar history, looks like I'm not the first one to
mention the need for supporting partial checkouts. It looked like there were
some discussions around whether developers should be testing with only a
partial repository but as has been mentioned before and as I hinted above,
there are times when not every file in the repository is needed for regular
development but you still want the files to be colocated. Checking out the
whole repository and hiding all the parts you didn't ask for won't cut it I
think.

Subversion provides native support for setting up read proxies. CVS and
Subversion provide read/write support through WANDisco, though it costs
money, as does Mercurial through the autosync plugin. I've been told before
that this is contrary to the philosophy of DVCS tools but I don't think so.
Even users of DVCS tools eventually have a single integration branch. The
need for setting up proxies is not to change that but simply to speed up the
operations in a distributed environment. Bazaar already has pieces of
support here that just need to be slightly extended.

For distributed read/write support (master-master), Bazaar needs something
like a distributed commit transaction. Bind sort of fits this role but not
quite yet. First, you want for repositories to be able to be bound to
multiple other repositories, not just one. Second, not only do you want the
commit to succeed in the parent repository before the child but you want the
commit to succeed in all the repositories or none of them. Third, you'll
need some sort of deadlock/livelock avoidance mechanism. It could be
locking, agreement by all parties on conflict resolution, or something as
simple as the ALOHA protocal (wait a random amount of time and try again). I
think locking is the safest way. The Mercurial autosync approach of sending
an email on conflicts won't scale well. Last, you need fault tolerance. In
any distributed scenario servers will go up and down and you want people to
be able to continue to work. As it appears many of you use IRC, you should
already be familiar with this problem except in this case automatic
resolution upon reconnection is more complicated. I think distributed
read/write may be too difficult to support in the near term.

There are some features that are needed just for the distributed read
support (master-slave) which are also needed for distributed read/write.
This use case is perhaps more relevant to open source projects and often
good enough even in the corporate environment, the reason being that
checkouts are usually much bigger than commits and therefore need to be much
faster. Repositories need to be chained together for this to work. Suppose
you have a master server and a read proxy which is bound to the master. In
the DVCS world, you either clone the proxy or check it out. Either way,
commit/push doesn't result in updating the master. After every commit, the
master then needs to update all of the proxies, which, to avoid a loop,
shouldn't result in an update back to the master. It would be nice if the
distribution to proxies were non-blocking as well. With distributed
repositories, it is possible that on occasion some or all of the servers
will become disconnected and so there needs to be some mechanism for
resyncing on reconnection. The options I see here are for the repositories
to update on reconnection, a periodic resync, or a check for coherency on
checkout/branch. I don't think the first option is compatible with the DVCS
philosophy and the second isn't great since there's an unnecessary period of
incoherency, though it results in the cost of distribution only being paid
on rare events (as opposed to every checkout/branch). There will also need
to be some access control mechanism to ensure that the read proxies remain
read-only except to updates from the master repository, which is only
possible I think right now through the use of a smart server (unintended
side effect?). For OS projects, it would be nice if there were a single
repository that would have a list of mirrors and the client would
automatically select the closest mirror.

Support for checkouts is great for scenarios where you don't expect to need
to commit to a local repository but there's still one feature missing:
lightweight branches. Cloning the whole repository takes way too long on a
large repository and consumes expensive disk space. Developers on these
large projects, perhaps even more so than OS projects, want support for
private branches and checkouts don't get you that. There should be a path to
convert from a lightweight checkout to a lightweight branch, the distinction
primarily being where the commits go. GIT seems to have gone the route of
shallow clones, where some specified subset of the repository is cloned with
paralyzing restrictions. Mercurial seems to be heading down the same route
and I think it is a useful scenario to cover for many projects. However, as
you all have noticed I think, by far the most common scenarios are a full
clone (creating a new repository) and lightweight clones (only latest files
for development). Private branches would be great for sharing changes
between developers before integration and avoiding polluting the repository.

It is nice that hooks are Python plugins, which provide a lot of
flexibility, but that also means that customization for a project, like to
support replication, impacts all users of bzr. You can put it in your home
directory but then it can't be easily shared for all users of the
repository. You could add that repository to the plugin path but then the
setup for using a repository becomes difficult. I think the simplest
solution here is just to add the repository to the plugin path
automatically.

Last, the only tool that I've found that can robustly read a CVS repository
is cvs2svn. It has support for Bazaar now but I ran into some problems with
it. I was able to convert trunk with history fairly easily into Bazaar but
when I told it to include all the branches and tags (it doesn't even support
specification or specific ones, you get all or nothing) I killed the process
after >1 day of running and an 80GB fast import file, which didn't even
appear to be remotely near completion. I think a bzr2svn tool would also be
great as it would provide some comfort to risk averse management, who could
always fall back on to something more established if something every went
horribly wrong with Bazaar.

Thanks for your patience reading this email and I would appreciate your
thoughts on these suggestions.

Uri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100107/fdb419b1/attachment-0001.htm