Feedback from evaluation in a corporate environment

Thu Jan 7 17:06:50 GMT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>   * Need to delete files with history on occasion (obliterate)

obliterate is tricky in a distributed environment. Just because you have
gotten rid of it in one place, doesn't mean it is gone everywhere else.
BitKeeper had the concept of an obliterate, which would also propagate.
(If it saw that this file was marked gone, it would propagate that
mark.) Having talked to MySQL, they found it to be an awful feature that
caused huge problems. (Invariably someone would mark something gone that
shouldn't be, and you had to somehow stop the spread before it got
merged into the production repo.)

Bazaar does have some support for "ghosting" (something is referenced
but not present). but if it ever encounters the actual content, it will
start holding on to it.

>   * Need to colocate all files in one repository so that they can be
> versioned together (which test runs with which version of software)

You don't need to colocate to get this effect. There are projects like
"scmproj" and "bzr-externals" that can get you the same level of
snapshot support, without having to have a single 4GB checkout of
everything. (Nested trees are a planned feature that would work similarly.)

With DVCS, you can reference a tree of files by a single identifier. So
saying that Tree X goes with Tree Y is a small amount of information to
track. (Versus SVN and CVS where you actually have to track the revision
of every file present.)

I'll go even further, and say that SVN doesn't really support what you
think it does. Specifically, if I have a checkout of projects A & B at
revisions 100. I do my changes, run my test suite, and commit. At that
point there have been 20 changes to project B, but none to project A
(which I'm working on). My commit succeeds just fine as revision 121.

However, if someone goes to SVN and does "svn co -r 121", they will not
get a checkout of project A & B that work together. Even though it
worked when I committed, SVN did not require that I was up to date with
project B when I committed to project A. (Note that this is actually at
the *file* level, so changes to file foo in A are not synchronized with
changes to file bar in A.)

DVCS such as Bazaar actually provide *better* guarantees within a
project, and you can use that to provide better guarantees across
projects. (stuff like scmproj allows you to create a snapshot listing
the state of all the trees you are currently using.)

>   * Sandboxes always checked out to NFS servers for easy sharing,
> backup/snapshots, and performance under load. Implies that disk space is
> expensive.
>   * Sandboxes typically split up on multiple NFS servers to isolate
> unique backup/snapshot and load requirements
> 

^- I don't fully understand your requirements here. What is being done
in the sandbox, what data is being fetch from and pushed into the
sandbox, etc.

...

> A common theme among DVCS tools seems to be cloning the whole
> repository, which in cases like this would be a disaster. This
> architecture does not scale well at all to larger projects. Bazaar
> stands out here with lightweight checkouts but is still missing in other
> areas.
> 
>   * No support for partial checkouts

Generally not. Partially because it violates the "here is a tree at
state X" when part of the state is not present. Committing could be done
to generate state Y, just assuming that all missing files are at their
original state. However merging is quite poorly defined if you have some
changes being merged which are not present in the subset.

A NestedTree design where you integrate separate trees actually provides
a better result for a variety of reasons. scmproj approximates that, as
unfortunately it hasn't been a primary goal for us to finish
implementing the design.

>   * No support for read (or ideally read/write) proxies

A read proxy is pretty much just another repository. I suppose you could
be asking for more of a read-through proxy. Such that you always connect
to the master to see what is available, and then fetch what you can from
location ABC, and everything remaining from the master. (Potentially
telling the ABC locations that they should be updated, or fetching the
data "through" them.)

I think this mostly breaks down into not-really-needed-yet. It would be
a fairly small amount of code to create a caching repository. Such that
if a client asked for the latest X, it would check another repo, fetch
anything missing, and then respond directly.

Write proxies are generally... just another repo. If you are at the
point of using proxies, I would guess you don't have to be 100% in sync.
Heck, given that CVS commits aren't atomic, doing a checkout is never
100% guaranteed to be in sync. If the checkout takes long enough, you
can certainly have half the tree in state X and half in state Y.

>   * Checkouts don't behave like branches

>   * No project specific hook/plugins

It is fairly easy to provide a plugin which then provides project
specific plugins. The main reason we don't, is because in a distributed
system, you have to be wary of untrusted sources.

If I merge from $JOE_RANDOM, I don't want to be running his untrusted
plugin code.

In a corporate environ, you are much less likely to merge from $JOE_RANDOM.

>   * Poor support for CVS migration

CVS is a very hard target to migrate away from. Because of a basic lack
of all sorts of features. (Atomic commits, merge information, etc.)
Generally history has to be inferred which makes it lossy-at-best.

...

> Subversion provides native support for setting up read proxies. CVS and
> Subversion provide read/write support through WANDisco, though it costs
> money, as does Mercurial through the autosync plugin. I've been told
> before that this is contrary to the philosophy of DVCS tools but I don't
> think so. Even users of DVCS tools eventually have a single integration
> branch. The need for setting up proxies is not to change that but simply
> to speed up the operations in a distributed environment. Bazaar already
> has pieces of support here that just need to be slightly extended.

SVN and CVS need custom support for this, because they don't have the
concept of distributed repositories. I don't specifically know what
you've found with Mercurial, but certainly some form of replication
could be pretty easily built onto any DVCS.

> 
> For distributed read/write support (master-master), Bazaar needs
> something like a distributed commit transaction. Bind sort of fits this
> role but not quite yet. First, you want for repositories to be able to
> be bound to multiple other repositories, not just one. Second, not only
> do you want the commit to succeed in the parent repository before the
> child but you want the commit to succeed in all the repositories or none
> of them. Third, you'll need some sort of deadlock/livelock avoidance
> mechanism. It could be locking, agreement by all parties on conflict
> resolution, or something as simple as the ALOHA protocal (wait a random
> amount of time and try again). I think locking is the safest way. The
> Mercurial autosync approach of sending an email on conflicts won't scale
> well. Last, you need fault tolerance. In any distributed scenario
> servers will go up and down and you want people to be able to continue
> to work. As it appears many of you use IRC, you should already be
> familiar with this problem except in this case automatic resolution upon
> reconnection is more complicated. I think distributed read/write may be
> too difficult to support in the near term.

Alternatively, if I have 2 branches, I instantly have master-master
support. When I want to sync, I merge & commit. You could automate that
if you wanted. Honestly, I think you're thinking about the problem space
as "how do I do exactly what I do today in a new VCS". Rather than "how
could I version my software with a DVCS". It really opens up all sorts
of different workflows, that really supersede the status-quo. In doing
so, it sometimes makes it harder to do the things you are used to doing.
(How does one do master-master replication if the whole tree must be
up-to-date before committing? It isn't possible to commit just part of
the tree anymore. Oh wait, you can just commit to a separate branch, and
merge when you actually do need synchronization...)

...
> The options I see here are
> for the repositories to update on reconnection, a periodic resync, or a
> check for coherency on checkout/branch. I don't think the first option
> is compatible with the DVCS philosophy and the second isn't great since
> there's an unnecessary period of incoherency, though it results in the
> cost of distribution only being paid on rare events (as opposed to every
> checkout/branch). 

DVCS are generally *always* coherent. It is just an issue of whether you
have the latest revision of the tree or not.

> Support for checkouts is great for scenarios where you don't expect to
> need to commit to a local repository but there's still one feature
> missing: lightweight branches. Cloning the whole repository takes way
> too long on a large repository and consumes expensive disk space.
> Developers on these large projects, perhaps even more so than OS
> projects, want support for private branches and checkouts don't get you
> that. There should be a path to convert from a lightweight checkout to a
> lightweight branch, the distinction primarily being where the commits
> go. GIT seems to have gone the route of shallow clones, where some
> specified subset of the repository is cloned with paralyzing
> restrictions. Mercurial seems to be heading down the same route and I
> think it is a useful scenario to cover for many projects. However, as
> you all have noticed I think, by far the most common scenarios are a
> full clone (creating a new repository) and lightweight clones (only
> latest files for development). Private branches would be great for
> sharing changes between developers before integration and avoiding
> polluting the repository.

Bazaar does have the concept of Stacked Branches, which we hope to
evolve into Shallow Branches. Stacked means that some data is available
here, and other data is available at a secondary location. Shallow would
be a stacked branch with a pre-defined amount of data locally. (such as,
just enough to recreate the whole working tree, data back for X
revisions, etc.)

However, people really don't prefer versioning a 4GB repository in
lock-step. As such, things usually get broken down into manageable
chunks, which means that you don't have to download the whole history of
all your entire corporate history. And so the actual data requirements
tend to be reasonable. The entire history of bzr fits in approximately
the same amount of space as the size of the checkout, which actually
holds true for a surprising range of projects. At least within an
order-of-magnitude of eachother. So if you are grabbing 4GB of tree
content, grabbing 8GB of history content isn't terrible.

> 
> It is nice that hooks are Python plugins, which provide a lot of
> flexibility, but that also means that customization for a project, like
> to support replication, impacts all users of bzr. You can put it in your
> home directory but then it can't be easily shared for all users of the
> repository. You could add that repository to the plugin path but then
> the setup for using a repository becomes difficult. I think the simplest
> solution here is just to add the repository to the plugin path
> automatically.
> 
> Last, the only tool that I've found that can robustly read a CVS
> repository is cvs2svn. It has support for Bazaar now but I ran into some
> problems with it. I was able to convert trunk with history fairly easily
> into Bazaar but when I told it to include all the branches and tags (it
> doesn't even support specification or specific ones, you get all or
> nothing) I killed the process after >1 day of running and an 80GB fast
> import file, which didn't even appear to be remotely near completion. I
> think a bzr2svn tool would also be great as it would provide some
> comfort to risk averse management, who could always fall back on to
> something more established if something every went horribly wrong with
> Bazaar.

Last I checked cvs2svn was fully capable of handling a subset of
branches to be converted. I certainly hope you are meaning a "cvs2bzr"
command, which actually already exists in the cvs2svn suite of tools.

I was converting a company over from CVS about 2 months ago, and I'm
pretty sure we converted a subset of the branches, and use "cvs2bzr" as
part of the conversion.

As for an 80GB intermediate... Yes, the intermediate file gets pretty
huge, as it generally extracts all texts without any deltas. I certainly
don't know your repository details, but I've seen things go certainly
10:1 the size of the source, and wouldn't be surprised to hit 100:1.

> 
> Thanks for your patience reading this email and I would appreciate your
> thoughts on these suggestions.
> 
> Uri

I think you raised some interesting points. I do think there are bits
that you mention here that will be implemented with time. I also think
that some of the tradeoffs when switching to DVCS are not immediately
apparent for you. Such as tree-wide consistency. This has benefits and
downsides, and thus has implications for how one would want to use the tool.

I certainly think that Bazaar would be capable of providing a system
which would work wonderfully in your environment. But it might not look
a lot like what you already have.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktGFKkACgkQJdeBCYSNAANqYwCgsv45UwJUYl6GLzY3HwRlFNqa
oDcAoM/eGxsJtpXiC5/rC6bEFTwoA/ss
=BXaV
-----END PGP SIGNATURE-----