Replacing an expensive proprietary CM system with bzr.

Tue Feb 26 21:48:00 GMT 2008

Is there a commentary anywhere in the docs on known scaling behaviour?

EG number of files/folders in a working tree, size of individual
files, total size of workign tree, number of revisions, commit
sizes...

We have a CVS tree that, in an experimental conversion to Subversion
produced around 25,000 revisions for a working tree with 17,000 files
in 3,500 folders.  The working tree totals (not counting any VCS
book-keeping) 800MB with about 1000 files accounting for 75% of that
volume.

There are a couple of dozen branches but I would expect 2/3rds of the
revisions to be on HEAD.

I'd be interested in knowing how this might perform in Bazaar -
obviously I'm not expecting numbers, but are there any known scaling
issues?

--
Talden

On Wed, Feb 27, 2008 at 9:05 AM, Robert Collins
<robertc at robertcollins.net> wrote:
>
>  On Tue, 2008-02-26 at 20:40 +0100, Jurgen Defurne wrote:
>  > I am currently doing an investigation to see if it would be possible
>  > (long term) to replace a commercial CM solution with a combination of
>  > Bazaar and mySQL. I only need to focus on Bazaar, since a whole lot of
>  > the bug tracking is currently supported by mySQL for caching and
>  > speed.
>
>  Cool
>
>
>  > * Necessary features for switching from the other system to bzr
>  > ** Graphical interface in Windows : context menu like TortoiseSVN
>
>  There is a rough tortoiseBZR at the moment, and I believe it is being
>  improved at the moment. Certainly we consider this an important feature.
>
>
>  > ** Integration of bug tracker and VCS via graphical interface
>
>  Bazaar supports metadata in commits; we have an option '--fixes' in the
>  bzr command line client that will record a bug being fixed. I'm fairly
>  sure this is also supported in the main GUI clients today, but if not it
>  should be quite trivial to do so. ('bzr help bugs' will give you more
>  details about this feature).
>
>
>  > ** Good graphical history view of objects
>
>  I think 'bzr viz' is quite good - certainly it has been reused by other
>  modern VCS tools to provide their graphical history view.
>
>  > ** NestedTreeSupport
>
>  This is currently immature; the bulk of the development work has been
>  completed, but the developers involved have prioritised performance
>  improvements over nested trees in the short term. We have a group
>  get-together next week and this is on the agenda. Nested trees are an
>  important feature :).
>
>
>  > ** How does Bazaar handle databases of 80 Gb and more ?
>  >    The main question here is, how can you improve on the speed of the
>  >    central repository when several developers at once are doing
>  >    updates or checkouts ? When the build manager is tagging 211
>  >    different checkouts in one tree ?
>
>  Well, I can't speak for how it handles it today, as I don't have the
>  facilities to realistically test on that scale. I can give you a few
>  thoughts about where we are today, and what we are doing in the future.
>  In two parts - working tree, and repository. For the repository: it is
>  part of the specification for nested tree that you can have nested trees
>  while still having separate history databases for each tree. This lets
>  you partition the IO workload required by your servers. Each repository
>  has a set of read only data files (packs), and commits create new files;
>  and from time to time combine existing files to prevent huge seek
>  activity when accessing your database. The combining operation gives
>  increased locality of reference for related data and helps with scaling
>  up. Our indices follow a similar scheme, with each index being keyed to
>  a specific pack file. Currently we buffer the region of each index that
>  is accessed during an operation in memory for performance. This in
>  extremely large databases could lead to memory pressure - but we can
>  reduce or eliminate that buffering. We already have plans to tweak the
>  index format to reduce the need for buffering. Updates and checkouts
>  only perform reads on the central repository, and the same disk blocks
>  will be read for developers working on the same region of your project,
>  allowing OS disk cache hits. There are developers working on better
>  delta logic too, which will approximately halve the size of your
>  historical database between the current bzr storage and the new one.
>
>  For the working tree, the checkout on disk that you commit to, the
>  primary indicator for performance is not the number of bytes in the
>  tree, but rather the total number of paths, and the number of lines in
>  any modified file. Tagging any number of subtrees - that could mean two
>  things. It could mean a commit in the top level followed by making a new
>  branch, or it could just mean making a new branch from a previously
>  known point in time. For the former, bzr will first perform essentially
>  'bzr status' on each tree to detect changes, and then record a new
>  inventory for the top level tree, which should be a fairly small tree
>  from the sound of it. Making a new branch in bzr requires writing a few
>  K of data when a shared repository is in use (and you'll likely want
>  one :)), so should be nearly instant.
>
>
>
>  > ** Daily build and acceptance work-flow description
>  >
>  >    Developers check out their private checkout, and use it for
>  >    development.
>  >
>  >    At the end of the day a program starts an update and a build. In
>  >    the morning, the test team checks the build
>  >    results and gives later in the day the go ahead to say that the
>  >    particular revision is allright. However, as part of the build,
>  >    there may be libraries and executables taken in, which need to be
>  >    committed. Does this mean that prior to daily acceptance no one may
>  >    commit, or is there another solution possible with bzr ?
>
>  Branches are a very useful thing to represent concurrency. I would say
>  here that your acceptance tool should have it's own branch. When it runs
>  it will:
>   - pull --overwrite the mainline into its branch
>   - test
>   - commit the tree if changes were made
>
>  Then, whoever is responsible for checking acceptance does:
>   - merge from the acceptance tools branch
>   - commit
>
>  And at this point developers will get the acceptance tools changes and
>  libraries when they next update from the mainline.
>
>
>
>  >    After daily acceptance, it should be possible to say to the
>  >    developers to which revision they may update their checkout.
>  >
>  >    Developers are responsible for merging their work against the
>  >    latest accepted (promoted) status.
>
>  > ** Multi-site work flow description
>  >    Daily check-ins are not sent to each repository. Rather, all
>  >    projects get assigned responsibility to one subsystem, and these
>  >    are developed in the context of other subsystems. There is a weekly
>  >    release. This weekly release should be taken into the other
>  >    repositories. The biggest problem with this part is the deletion of
>  >    objects.
>  >
>  >   I suppose that in the case of daily-build and multi-site work flow,
>  >   the solutions are laying in the distributed development model of
>  >   bzr. In this case, I should probably first do a comparison of terms.
>  >
>  > | Proprietary | bzr                   |
>  > |-------------+-----------------------|
>  > | database    | repository            |
>  > | project     | light-weight checkout |
>  > | version     | revision              |
>  >
>  >   The other system has an export mechanism which makes it possible to
>  >   pack a certain revision of a project into a package which can be
>  >   transferred to another database and recreated there. This mechanism
>  >   can be used differentially, in which case it is possible to send
>  >   only deltas between the originating and the receiving databases. I
>  >   suppose the only available mechanism in lieu of this is probably
>  >   diff and patch.
>
>  Why not push and pull ?
>
>
>  >   I should be running checks to see which transport mechanism is the
>  >   fastest. I have already established that the file:// protocol, over
>  >   a CIFS share is very slow.
>
>  What latency do you have involved between your workstation and the CIFS
>  share?
>  How many files in the checkout?
>  How many revisions?
>  (bzr info -v will answer some of this).
>
>  You may be running into scaling issues with your tree size - in which
>  case I'd be delighted to help you track the down sufficiently that I can
>  file a bug report (which can then be fixed).
>
>
>  > ** The way access rights have to be assigned to a repository is not
>  >    clear
>  >    Especially since I am now doing experiments in a Cygwin environment,
>  >    I got twice problems with locks which could not be removed from the
>  >    repository after doing a checkout.
>
>  Cygwin is, in general, very slow. (I spent some time as a cygwin
>  developer - this is not a criticism). I strongly recommend using the
>  native python version of bzr as we have done the porting work in bzr,
>  which means less overhead and better performance.
>
>
>  --
>  GPG key available at: <http://www.robertcollins.net/keys.txt>.
>  >
>