Hi,<br>I evaluated Bazaar for use in a corporate environment and just wanted to provide you all with some (lengthy) feedback with suggestions on what would be needed for projects like ours. First a little bit about our environment.<br>
<br> * ~50 developers geographically distributed<br> * Using CVS, with one repository per site all kept in sync using a custom setup<br> * ~10GB repository (few binaries but 15 years of history)<br> * ~4GB for full checkout, ~500MB for just code needed for regular development<br>
* ~500MB for checkout of just code needed for regular development<br> * Need to delete files with history on occasion (obliterate)<br> * Need to colocate all files in one repository so that they can be versioned together (which test runs with which version of software)<br>
* Sandboxes always checked out to NFS servers for easy sharing, backup/snapshots, and performance under load. Implies that disk space is expensive.<br> * Sandboxes typically split up on multiple NFS servers to isolate unique backup/snapshot and load requirements<br>
<br>I think environments like this are common in the corporate world and unfortunately the only relatively modern open source tool that seems to work in a such an environment today is Subversion. Perforce also seems to be popular in such setups. There are a few OS projects which also seem to have a comparably large repository. KDE is one example, which still mostly uses Subversion due to the large repository I'm guessing. I'm posting to the Bazaar mailing list because of all the modern tools, I think Bazaar is the closest to supporting an environment such as this.<br>
<br>A common theme among DVCS tools seems to be cloning the whole repository, which in cases like this would be a disaster. This architecture does not scale well at all to larger projects. Bazaar stands out here with lightweight checkouts but is still missing in other areas.<br>
<br> * No support for partial checkouts<br> * No support for read (or ideally read/write) proxies<br> * Checkouts don't behave like branches<br> * No project specific hook/plugins<br> * Poor support for CVS migration<br>
<br>Browsing through some Bazaar history, looks like I'm not the first one to mention the need for supporting partial checkouts. It looked like there were some discussions around whether developers should be testing with only a partial repository but as has been mentioned before and as I hinted above, there are times when not every file in the repository is needed for regular development but you still want the files to be colocated. Checking out the whole repository and hiding all the parts you didn't ask for won't cut it I think.<br>
<br>Subversion provides native support for setting up read proxies. CVS and Subversion provide read/write support through WANDisco, though it costs money, as does Mercurial through the autosync plugin. I've been told before that this is contrary to the philosophy of DVCS tools but I don't think so. Even users of DVCS tools eventually have a single integration branch. The need for setting up proxies is not to change that but simply to speed up the operations in a distributed environment. Bazaar already has pieces of support here that just need to be slightly extended.<br>
<br>For distributed read/write support (master-master), Bazaar needs something like a distributed commit transaction. Bind sort of fits this role but not quite yet. First, you want for repositories to be able to be bound to multiple other repositories, not just one. Second, not only do you want the commit to succeed in the parent repository before the child but you want the commit to succeed in all the repositories or none of them. Third, you'll need some sort of deadlock/livelock avoidance mechanism. It could be locking, agreement by all parties on conflict resolution, or something as simple as the ALOHA protocal (wait a random amount of time and try again). I think locking is the safest way. The Mercurial autosync approach of sending an email on conflicts won't scale well. Last, you need fault tolerance. In any distributed scenario servers will go up and down and you want people to be able to continue to work. As it appears many of you use IRC, you should already be familiar with this problem except in this case automatic resolution upon reconnection is more complicated. I think distributed read/write may be too difficult to support in the near term.<br>
<br>There are some features that are needed just for the distributed read
support (master-slave) which are also needed for distributed
read/write. This use case is perhaps more relevant to open source
projects and often good enough even in the corporate environment, the
reason being that checkouts are usually much bigger than commits and
therefore need to be much faster. Repositories need to be chained
together for this to work. Suppose you have a master server and a read
proxy which is bound to the master. In the DVCS world, you either clone
the proxy or check it out. Either way, commit/push doesn't result in
updating the master. After every commit, the master then needs to
update all of the proxies, which, to avoid a loop, shouldn't result in
an update back to the master. It would be nice if the distribution to
proxies were non-blocking as well. With distributed repositories, it is
possible that on occasion some or all of the servers will become
disconnected and so there needs to be some mechanism for resyncing on
reconnection. The options I see here are for the repositories to update
on reconnection, a periodic resync, or a check for coherency on
checkout/branch. I don't think the first option is compatible with the
DVCS philosophy and the second isn't great since there's an unnecessary
period of incoherency, though it results in the cost of distribution
only being paid on rare events (as opposed to every checkout/branch).
There will also need to be some access control mechanism to ensure that
the read proxies remain read-only except to updates from the master
repository, which is only possible I think right now through the use of
a smart server (unintended side effect?). For OS projects, it would be
nice if there were a single repository that would have a list of
mirrors and the client would automatically select the closest mirror.<br><br>Support for checkouts is great for scenarios where you don't expect to need to commit to a local repository but there's still one feature missing: lightweight branches. Cloning the whole repository takes way too long on a large repository and consumes expensive disk space. Developers on these large projects, perhaps even more so than OS projects, want support for private branches and checkouts don't get you that. There should be a path to convert from a lightweight checkout to a lightweight branch, the distinction primarily being where the commits go. GIT seems to have gone the route of shallow clones, where some specified subset of the repository is cloned with paralyzing restrictions. Mercurial seems to be heading down the same route and I think it is a useful scenario to cover for many projects. However, as you all have noticed I think, by far the most common scenarios are a full clone (creating a new repository) and lightweight clones (only latest files for development). Private branches would be great for sharing changes between developers before integration and avoiding polluting the repository.<br>
<br>It is nice that hooks are Python plugins, which provide a lot of flexibility, but that also means that customization for a project, like to support replication, impacts all users of bzr. You can put it in your home directory but then it can't be easily shared for all users of the repository. You could add that repository to the plugin path but then the setup for using a repository becomes difficult. I think the simplest solution here is just to add the repository to the plugin path automatically.<br>
<br>Last, the only tool that I've found that can robustly read a CVS repository is cvs2svn. It has support for Bazaar now but I ran into some problems with it. I was able to convert trunk with history fairly easily into Bazaar but when I told it to include all the branches and tags (it doesn't even support specification or specific ones, you get all or nothing) I killed the process after >1 day of running and an 80GB fast import file, which didn't even appear to be remotely near completion. I think a bzr2svn tool would also be great as it would provide some comfort to risk averse management, who could always fall back on to something more established if something every went horribly wrong with Bazaar.<br>
<br>Thanks for your patience reading this email and I would appreciate your thoughts on these suggestions.<br><br>Uri<br>