Feedback from evaluation in a corporate environment

Uri Moszkowicz uri at 4refs.com
Thu Jan 7 23:03:41 GMT 2010


Hi John,
Thanks for the detailed replies. Some more comments below.

obliterate is tricky in a distributed environment. Just because you have
> gotten rid of it in one place, doesn't mean it is gone everywhere else.
> BitKeeper had the concept of an obliterate, which would also propagate.
> (If it saw that this file was marked gone, it would propagate that
> mark.) Having talked to MySQL, they found it to be an awful feature that
> caused huge problems. (Invariably someone would mark something gone that
> shouldn't be, and you had to somehow stop the spread before it got
> merged into the production repo.)
>
> Bazaar does have some support for "ghosting" (something is referenced
> but not present). but if it ever encounters the actual content, it will
> start holding on to it.
>

I don't think there's a way to obviate the need for obliterate, though I
agree that the implementation is problematic. Corporate environments don't
have trust issues like MySQL and I think Bazaar can deal with the different
environments through configuration options, and perhaps even provide
separate distributions to minimize out of the box configuration.

In the corporate environment we still have to worry about mistakes and
environments like these usually have independent solutions to dealing with
the problem of accidental obliteration. We would likely just pull the
repository from a recent snapshot or at worst a backup. For projects not
hosted on a NetApp, you could emulate that feature by moving the obliterated
file to a recycle bin for a period of time, purging it no sooner than some
threshold (1 week?) at the next repack.


> You don't need to colocate to get this effect. There are projects like
> "scmproj" and "bzr-externals" that can get you the same level of
> snapshot support, without having to have a single 4GB checkout of
> everything. (Nested trees are a planned feature that would work similarly.)
>

> With DVCS, you can reference a tree of files by a single identifier. So
> saying that Tree X goes with Tree Y is a small amount of information to
> track. (Versus SVN and CVS where you actually have to track the revision
> of every file present.)
>

> I'll go even further, and say that SVN doesn't really support what you
> think it does. Specifically, if I have a checkout of projects A & B at
> revisions 100. I do my changes, run my test suite, and commit. At that
> point there have been 20 changes to project B, but none to project A
> (which I'm working on). My commit succeeds just fine as revision 121.
>
> However, if someone goes to SVN and does "svn co -r 121", they will not
> get a checkout of project A & B that work together. Even though it
> worked when I committed, SVN did not require that I was up to date with
> project B when I committed to project A. (Note that this is actually at
> the *file* level, so changes to file foo in A are not synchronized with
> changes to file bar in A.)
>
> DVCS such as Bazaar actually provide *better* guarantees within a
> project, and you can use that to provide better guarantees across
> projects. (stuff like scmproj allows you to create a snapshot listing
> the state of all the trees you are currently using.)
>

You have to be careful with reliance on non-packaged plugins as it can be
difficult for someone evaluating a tool to determine if some configuration
of the software would suit their needs, while at the same time trying to
evaluate the compatibility and stability of the plugins.

In any case, I think I may have confused you. I was only suggesting that
breaking the repository is only a possible solution to the problem but not a
desirable one. The idea was to break apart the repository so that you don't
necessarily have to checkout all the files. I think externals makes sense
primarily when you will be using the work of either an independent team or
project (libraries for example). Unfortunately, our repository is
independently big so the only way to split it up would be in an unnatural
way.

Our repository is so large that even lightweight checkouts are problematic.
Our CVS repository, like many others I'm sure, has two modules: software and
tests. In a bzr repository with just software, a lightweight checkout takes
~1min and 1GB - not a problem at all (w/CVS is more like 10min). In a bzr
repository with software and tests, a lightweight checkout takes 26min and
7GB. So quite a difference, both in terms of time and disk space, keeping in
mind that disk space on a NetApp is much more expensive that on a commodity
desktop. We typically need the whole software module to build and test
software but don't usually run the full test suite (because it requires
multiple days on a large farm) so only a small fraction of those files are
typically needed for development.

The need for partial checkouts becomes even more apparent when not doing
checkouts. They aren't just more familiar to CVS users, they are also much
faster. The heavyweight checkout of the bzr repository with just software
took 3.25min, or ~3.5x longer, and the doesn't even include any branches! (I
can't run the experiment with branches due to the problem described later).
Based on the above numbers, I'd estimate that a branch of software and
tests, also without any branches, would take about 1.5 hours. I think you'd
agree that these times are too long to be productive with and these are
virtually local operations on high end (granted, using NFS) hardware so they
don't get much faster.


>
> >   * Sandboxes always checked out to NFS servers for easy sharing,
> > backup/snapshots, and performance under load. Implies that disk space is
> > expensive.
> >   * Sandboxes typically split up on multiple NFS servers to isolate
> > unique backup/snapshot and load requirements
> >
>
> ^- I don't fully understand your requirements here. What is being done
> in the sandbox, what data is being fetch from and pushed into the
> sandbox, etc.
>

What I meant was that the repository, like most, contains software and
tests. Software, being written by humans, needs a lot of snapshots and
regular backups. Tests can generate a lot of files and I/O so you don't want
as many snapshots but you might want a few because they can take a long time
to run and you want to avoid accidentally deleting files. There is no need
to backup files generated by tests either. Object files are easy to
regenerate so you don't keep snapshots or backups.

The implication here for the SCM tool is that the files from the repository
need to go into two separate paths: one for software and one for tests. An
alternative would be to checkout the whole repository in both places but
again that would be unaffordably wasteful.


> Generally not. Partially because it violates the "here is a tree at
> state X" when part of the state is not present. Committing could be done
> to generate state Y, just assuming that all missing files are at their
> original state. However merging is quite poorly defined if you have some
> changes being merged which are not present in the subset.
>
> A NestedTree design where you integrate separate trees actually provides
> a better result for a variety of reasons. scmproj approximates that, as
> unfortunately it hasn't been a primary goal for us to finish
> implementing the design.
>

I don't think you need to change your philosophy about the atomic view of a
repository, just that you shouldn't need to view the whole thing at once.
What's the problem with merging? If the files aren't checked out, then
there's no need to do any merging. Why does Bazaar even need to concern
itself with files that haven't been checked out? Remember, the SCM tool only
provides the ability to describe versions of content and you can certainly
describe how only a portion of it needs to change.

The problem with a NestedTree design is the need to determine the tree
structure a priori. I think once you have it implemented, you could
pessimistically consider every directory to be a nested tree and then you've
got partial checkouts. I'm guessing that this would restrict efficiency of
the database, in addition to being inconvenient if it were exposed to the
interface so I don't consider this to be an effective replacement.

A read proxy is pretty much just another repository. I suppose you could
> be asking for more of a read-through proxy. Such that you always connect
> to the master to see what is available, and then fetch what you can from
> location ABC, and everything remaining from the master. (Potentially
> telling the ABC locations that they should be updated, or fetching the
> data "through" them.)
>

That is roughly correct, though you've restricted the implementation to one
where updates are passed to the cache on the branch/checkout operation
rather than prefetching, which would lower the average response time for
users. To maintain coherency with the master server and get maximum
performance, you really need both though since at times the master may be
disconnected from the read-through proxy. I agree that this isn't too hard
to implement and I was considering doing it until I discovered the lack of
support for partial checkouts.


> I think this mostly breaks down into not-really-needed-yet. It would be
> a fairly small amount of code to create a caching repository. Such that
> if a client asked for the latest X, it would check another repo, fetch
> anything missing, and then respond directly.
>

It is only needed for global projects with large repositories. Looking at
the WANDisco customer list, there are certainly customers who need it and
could use it now. I'm sure Subversion gets some amount of adoption as well
because of this support. Bazaar is your project so I can't tell you how to
prioritize your work, only that for projects like the one I work on it is
needed and then hope that the feature gets added near the top of the list :)
I think it would also be useful to projects which have already latched on to
Bazaar as well and would provide another distinct advantage of Bazaar over
Mercurial and GIT.


>
> Write proxies are generally... just another repo. If you are at the
> point of using proxies, I would guess you don't have to be 100% in sync.
> Heck, given that CVS commits aren't atomic, doing a checkout is never
> 100% guaranteed to be in sync. If the checkout takes long enough, you
> can certainly have half the tree in state X and half in state Y.
>

Yes it is like another repo but not exactly - that's why I say Bazaar is
close but not quite there. The repositories don't have to be 100% in sync at
all times they just have to appear to be that way at the times that
operations are done. The repositories have to be coherent, just like caches
have to be coherent in a multi-core CPU.

Caches have it easy though because the cache line updates can't fail. With
write proxies, you can run into deadlocks and livelocks that result in
repositories that can never be in sync. Imagine that two users
simultaneously commit changes to their local repositories. Both make their
commits locally and then try to update each other in a way that results in a
conflict. When detected, both need to undo their commit to all repositories
where they have gone through successfully otherwise they will remain out of
sync with different contents. Suppose you solve this problem, then you have
to worry about those two users repeatedly trying to merge, resulting in a
livelock. And we haven't even gotten into what happens if the network
becomes disconnected and then reconnects later. There are solutions to these
problems the point that I'm trying to make is that it's more than just
another repository. I think they are hard to solve that at least initially
you should stick to read proxies.

We have added locking to our CVS repositories to avoid the problem you
described btw.


> It is fairly easy to provide a plugin which then provides project
> specific plugins. The main reason we don't, is because in a distributed
> system, you have to be wary of untrusted sources.
>
> If I merge from $JOE_RANDOM, I don't want to be running his untrusted
> plugin code.
>
> In a corporate environ, you are much less likely to merge from $JOE_RANDOM.
>

Yes, as you state trust isn't a problem in the corporate environment but
supporting multiple users is so we have the exact opposite problem that OS
users have. The same Bazaar installation will be used for multiple projects
and to fetch projects off the web. They all have to share the same plugins,
even though some of the plugins will be specific to only certain projects.
There's work required in making sure that the customizations for one project
don't interfere with those for another.

I understand the security concern but if you're downloading software from
$JOE_RANDOM, aren't you already exposing yourself to a security risk through
the content itself? Also, it's not like you've necessarily eliminated the
plugin security risk as to make use of the repository you may still have to
install those plugins. The security precaution in this case has only
resulted in creating more work for you to use that repository. Just because
the plugins are branch specific doesn't mean that they have to travel with
the branch. It would just be nice if you put it in there that it would be
used. I think it is useful to have the ability to have the plugins travel
with the branch, however, and in that case maybe the best solution is to add
the ability for Bazaar to note which sources are trusted like SSH does.

 SVN and CVS need custom support for this, because they don't have the
> concept of distributed repositories. I don't specifically know what
> you've found with Mercurial, but certainly some form of replication
> could be pretty easily built onto any DVCS.
>

You should provide it out of the box to win more users :) What's easy for
you will be tough for most people (many won't know Python, won't know if
it's even easy to do, etc). I agree, though, that conceptually it's easy to
do with any DVCS.


> Alternatively, if I have 2 branches, I instantly have master-master
> support. When I want to sync, I merge & commit. You could automate that
> if you wanted. Honestly, I think you're thinking about the problem space
> as "how do I do exactly what I do today in a new VCS". Rather than "how
> could I version my software with a DVCS". It really opens up all sorts
> of different workflows, that really supersede the status-quo. In doing
> so, it sometimes makes it harder to do the things you are used to doing.
> (How does one do master-master replication if the whole tree must be
> up-to-date before committing? It isn't possible to commit just part of
> the tree anymore. Oh wait, you can just commit to a separate branch, and
> merge when you actually do need synchronization...)
>

As I mentioned earlier, automation is challenging, which is why a product
appeared to fill this need (WANDisco). What you are describing is not a
change in how we use the new VCS but rather how we organize our teams and
collaborate. The trouble with separate repositories is that integration
doesn't happen as frequently and a small team can get away with integrating
continuously rather than at periodic intervals. There's just an application
for DVCS that you haven't considered: to geographically distribute a
centralized repository. Furthermore, there are hybrid workflows which are
interesting. The integration is centralized but developers still want to
have private branches and to share changes with peers using those branches,
which can only be provided by a DVCS.

CVCS tools work well with small cohesive teams. Such teams tend to avoid
outsourcing because they can no longer operate this way with the
geographical divide and the cost can outweigh the gains. A DCVCS as I've
proposed solves this problem.


>
> ...
> > The options I see here are
> > for the repositories to update on reconnection, a periodic resync, or a
> > check for coherency on checkout/branch. I don't think the first option
> > is compatible with the DVCS philosophy and the second isn't great since
> > there's an unnecessary period of incoherency, though it results in the
> > cost of distribution only being paid on rare events (as opposed to every
> > checkout/branch).
>
> DVCS are generally *always* coherent. It is just an issue of whether you
> have the latest revision of the tree or not.
>

A DVCS is only coherent if it always update to the latest revision before
performing a checkout/branch.


> Bazaar does have the concept of Stacked Branches, which we hope to
> evolve into Shallow Branches. Stacked means that some data is available
> here, and other data is available at a secondary location. Shallow would
> be a stacked branch with a pre-defined amount of data locally. (such as,
> just enough to recreate the whole working tree, data back for X
> revisions, etc.)
>
> However, people really don't prefer versioning a 4GB repository in
> lock-step. As such, things usually get broken down into manageable
> chunks, which means that you don't have to download the whole history of
> all your entire corporate history. And so the actual data requirements
> tend to be reasonable. The entire history of bzr fits in approximately
> the same amount of space as the size of the checkout, which actually
> holds true for a surprising range of projects. At least within an
> order-of-magnitude of eachother. So if you are grabbing 4GB of tree
> content, grabbing 8GB of history content isn't terrible.
>

I think in the corporate environment you often don't have much of a choice
about how big your repository is. Even if we could agree that it is useful
to break it down, corporations are full of bureaucracy that makes it really
tough. As I showed earlier, the ratio is more like 3.5x rather than 2x, and
likely much larger when you include all the missing branch history that I
didn't import. The ratio grows with time, which is a scalability problem.


>
> I certainly think that Bazaar would be capable of providing a system
> which would work wonderfully in your environment. But it might not look
> a lot like what you already have.
>

I'm not so confident about that. I think we'd have to break up our
repository into smaller chunks, which means we'll have to rewrite our build
system. Then we'd have to devise a scheme for integrating the products from
the different groups, which means that problems won't be detected quickly.
Staff would likely have to be hired to perform those integrations. I was
attracted by Bazaar because it advertises all the workflows it supports as
an advantage over competing tools. My goal, here, was to describe a workflow
which, I'll admit has some tradeoffs and is not appropriate for everyone,
Bazaar does not currently support but which others would likely be
interested in. I would describe it as "Decentralized with distributed shared
main line".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100107/a8baf6b6/attachment-0001.htm 


More information about the bazaar mailing list