Replacing an expensive proprietary CM system with bzr.
Talden
talden at gmail.com
Tue Feb 26 21:48:00 GMT 2008
Is there a commentary anywhere in the docs on known scaling behaviour?
EG number of files/folders in a working tree, size of individual
files, total size of workign tree, number of revisions, commit
sizes...
We have a CVS tree that, in an experimental conversion to Subversion
produced around 25,000 revisions for a working tree with 17,000 files
in 3,500 folders. The working tree totals (not counting any VCS
book-keeping) 800MB with about 1000 files accounting for 75% of that
volume.
There are a couple of dozen branches but I would expect 2/3rds of the
revisions to be on HEAD.
I'd be interested in knowing how this might perform in Bazaar -
obviously I'm not expecting numbers, but are there any known scaling
issues?
--
Talden
On Wed, Feb 27, 2008 at 9:05 AM, Robert Collins
<robertc at robertcollins.net> wrote:
>
> On Tue, 2008-02-26 at 20:40 +0100, Jurgen Defurne wrote:
> > I am currently doing an investigation to see if it would be possible
> > (long term) to replace a commercial CM solution with a combination of
> > Bazaar and mySQL. I only need to focus on Bazaar, since a whole lot of
> > the bug tracking is currently supported by mySQL for caching and
> > speed.
>
> Cool
>
>
> > * Necessary features for switching from the other system to bzr
> > ** Graphical interface in Windows : context menu like TortoiseSVN
>
> There is a rough tortoiseBZR at the moment, and I believe it is being
> improved at the moment. Certainly we consider this an important feature.
>
>
> > ** Integration of bug tracker and VCS via graphical interface
>
> Bazaar supports metadata in commits; we have an option '--fixes' in the
> bzr command line client that will record a bug being fixed. I'm fairly
> sure this is also supported in the main GUI clients today, but if not it
> should be quite trivial to do so. ('bzr help bugs' will give you more
> details about this feature).
>
>
> > ** Good graphical history view of objects
>
> I think 'bzr viz' is quite good - certainly it has been reused by other
> modern VCS tools to provide their graphical history view.
>
> > ** NestedTreeSupport
>
> This is currently immature; the bulk of the development work has been
> completed, but the developers involved have prioritised performance
> improvements over nested trees in the short term. We have a group
> get-together next week and this is on the agenda. Nested trees are an
> important feature :).
>
>
> > ** How does Bazaar handle databases of 80 Gb and more ?
> > The main question here is, how can you improve on the speed of the
> > central repository when several developers at once are doing
> > updates or checkouts ? When the build manager is tagging 211
> > different checkouts in one tree ?
>
> Well, I can't speak for how it handles it today, as I don't have the
> facilities to realistically test on that scale. I can give you a few
> thoughts about where we are today, and what we are doing in the future.
> In two parts - working tree, and repository. For the repository: it is
> part of the specification for nested tree that you can have nested trees
> while still having separate history databases for each tree. This lets
> you partition the IO workload required by your servers. Each repository
> has a set of read only data files (packs), and commits create new files;
> and from time to time combine existing files to prevent huge seek
> activity when accessing your database. The combining operation gives
> increased locality of reference for related data and helps with scaling
> up. Our indices follow a similar scheme, with each index being keyed to
> a specific pack file. Currently we buffer the region of each index that
> is accessed during an operation in memory for performance. This in
> extremely large databases could lead to memory pressure - but we can
> reduce or eliminate that buffering. We already have plans to tweak the
> index format to reduce the need for buffering. Updates and checkouts
> only perform reads on the central repository, and the same disk blocks
> will be read for developers working on the same region of your project,
> allowing OS disk cache hits. There are developers working on better
> delta logic too, which will approximately halve the size of your
> historical database between the current bzr storage and the new one.
>
> For the working tree, the checkout on disk that you commit to, the
> primary indicator for performance is not the number of bytes in the
> tree, but rather the total number of paths, and the number of lines in
> any modified file. Tagging any number of subtrees - that could mean two
> things. It could mean a commit in the top level followed by making a new
> branch, or it could just mean making a new branch from a previously
> known point in time. For the former, bzr will first perform essentially
> 'bzr status' on each tree to detect changes, and then record a new
> inventory for the top level tree, which should be a fairly small tree
> from the sound of it. Making a new branch in bzr requires writing a few
> K of data when a shared repository is in use (and you'll likely want
> one :)), so should be nearly instant.
>
>
>
> > ** Daily build and acceptance work-flow description
> >
> > Developers check out their private checkout, and use it for
> > development.
> >
> > At the end of the day a program starts an update and a build. In
> > the morning, the test team checks the build
> > results and gives later in the day the go ahead to say that the
> > particular revision is allright. However, as part of the build,
> > there may be libraries and executables taken in, which need to be
> > committed. Does this mean that prior to daily acceptance no one may
> > commit, or is there another solution possible with bzr ?
>
> Branches are a very useful thing to represent concurrency. I would say
> here that your acceptance tool should have it's own branch. When it runs
> it will:
> - pull --overwrite the mainline into its branch
> - test
> - commit the tree if changes were made
>
> Then, whoever is responsible for checking acceptance does:
> - merge from the acceptance tools branch
> - commit
>
> And at this point developers will get the acceptance tools changes and
> libraries when they next update from the mainline.
>
>
>
> > After daily acceptance, it should be possible to say to the
> > developers to which revision they may update their checkout.
> >
> > Developers are responsible for merging their work against the
> > latest accepted (promoted) status.
>
> > ** Multi-site work flow description
> > Daily check-ins are not sent to each repository. Rather, all
> > projects get assigned responsibility to one subsystem, and these
> > are developed in the context of other subsystems. There is a weekly
> > release. This weekly release should be taken into the other
> > repositories. The biggest problem with this part is the deletion of
> > objects.
> >
> > I suppose that in the case of daily-build and multi-site work flow,
> > the solutions are laying in the distributed development model of
> > bzr. In this case, I should probably first do a comparison of terms.
> >
> > | Proprietary | bzr |
> > |-------------+-----------------------|
> > | database | repository |
> > | project | light-weight checkout |
> > | version | revision |
> >
> > The other system has an export mechanism which makes it possible to
> > pack a certain revision of a project into a package which can be
> > transferred to another database and recreated there. This mechanism
> > can be used differentially, in which case it is possible to send
> > only deltas between the originating and the receiving databases. I
> > suppose the only available mechanism in lieu of this is probably
> > diff and patch.
>
> Why not push and pull ?
>
>
> > I should be running checks to see which transport mechanism is the
> > fastest. I have already established that the file:// protocol, over
> > a CIFS share is very slow.
>
> What latency do you have involved between your workstation and the CIFS
> share?
> How many files in the checkout?
> How many revisions?
> (bzr info -v will answer some of this).
>
> You may be running into scaling issues with your tree size - in which
> case I'd be delighted to help you track the down sufficiently that I can
> file a bug report (which can then be fixed).
>
>
> > ** The way access rights have to be assigned to a repository is not
> > clear
> > Especially since I am now doing experiments in a Cygwin environment,
> > I got twice problems with locks which could not be removed from the
> > repository after doing a checkout.
>
> Cygwin is, in general, very slow. (I spent some time as a cygwin
> developer - this is not a criticism). I strongly recommend using the
> native python version of bzr as we have done the porting work in bzr,
> which means less overhead and better performance.
>
>
> --
> GPG key available at: <http://www.robertcollins.net/keys.txt>.
> >
>
More information about the bazaar
mailing list