Work flow on large repositories

Wed Jul 28 21:17:18 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Hope wrote:
> Hi there.  I'm working on the gcc-linaro branch which is stored under
> bzr and hosted on Launchpad.  This is a fairly big branch as it was
> imported from upstream SVN and contains a large amount of history.
> 
> Most of the work is day-to-day changes on topic branches.  I also want
> to run a buildbot style program that continually updates and builds
> the latest.

This is my general setup for using a small number of working trees and
*lots* of feature branches. I also find that it is very useful to have a
pristine version of trunk checked out, so that I can reference it versus
what I'm currently working on.

bzr init-repo --no-trees project
cd project
# This both fetches the history data, *and* makes a working copy, *and*
# binds it to $SOURCE so that I know it will stay in sync. (no chance to
# commit in what should be my pristine trunk branch)
# Also, if the project doesn't use something like PQM, you can 'land'
# changes directly to trunk via this checkout.
bzr co $SOURCE trunk

# And this sets up a place where I can do work
bzr co --lightweight trunk work
# At this point, what you really want is a feature branch
cd work
bzr branch --switch ../trunk ../feature_branch

# Now you can easily create feature branches. I personally use the
# bzr branch --switch ../trunk notation because I want almost all of my
# branches based off of trunk, rather than whatever current feature I'm
# working on. Some people would prefer 'bzr switch -b ../new-feature',
# which creates a new feature branch based of the current one.

# Also note that 'bzr branch --switch' works well with remote branches
# Lets me grab someone elses' code, bring it to a local branch, and
# update the local working tree at the same time.
bzr branch --switch lp:~otheruser/project/feature ../otheruser-feature

# I also usually have at least 2 'working' areas
# Mostly because of our review system. So I'm often working on the next
# feature, when I might need to go back and clean up an old one. And I
# want to do so without disturbing my current work.
bzr co --lightweight feature alt_work

# Note also that by having a small number of working dirs, your rebuild
# times, etc, should be faster. Since it should only have to rebuild the
# actual changes.

> 
> My issue is that the various operations are taking too long.  Could
> anyone suggest tricks or a different work flow to speed things up?
> 
> Some of the operations include:
> 
> Creating a mirror branch by doing init-repo, branch lp:gcc-linaro/4.4.
>  The finding revisions stage takes about 10 minutes at 1kB/s.  The
> download stage is much faster.

As mentioned by Andrew, this is a weakness, and something we should
definitely fix in one fashion or another. (Faster discovery, or a
special case for the empty repo.)

> 
> Day-to-day work is done on topic branches.  Creating the branch takes
> 46 s, 250 MB of RAM, and creates a 20 MB .bzr directory.  Pushing this
> branch to LP for merging involves pushing the full 20 MB, but this is
> acceptable.

I can't say I fully understand this. Testing it here:
 $ time bzr branch trunk test -Dmemory
 PeakWorking          25824 KiB
 real    0m0.974s
 $ du -ksh test/.bzr/
 265K    test/.bzr/

So that is 1s, 0.25MB and peak memory of 26MB. Now, you may be creating
topic branches with a working tree, but then this doesn't really line up
with your later comment.

(249K of that 265K is the 'tags' file. which has about 2.6k tags, and a
lot of those are stuff like:
var-tracking-assignments-merge-148582-after
var-tracking-assignments-merge-148582-before
var-tracking-assignments-merge-148582-trunk
)

Side note: I believe Ian Clatworthy observed that bzr-svn's file-id
layout, etc, are not optimal for bzr's internal heuristics. And that
exporting and fast-importing caused a significant reduction in the
size-on-disk. It may be too late to do anything about that now, though.

> 
> Doing a bzr pull on the 4.4 mirror directory may more than half an
> hour and more than 500 MB of memory.

Is that with the mirror already up to date? To branch all of
lp:gcc-linaro took me about 65minutes. To add 4.4 to it took 8.5min and
52MB of content transfer.

Note, however, that it includes 1388 revisions that are in the 4.4
branch but not in the 4.5 branch.

> 
> Doing a bzr checkout takes over 20 minutes and 800 MB of memory on my
> fastest machine.  On my netbook and ARM board this causes significant
> swapping.  I've yet to complete a checkout on either.

With bzr.dev I do see a lot of memory consumed during 'bzr co
- --lightweight'. In fact, it gets to about 400+MB before it even gets to
the 'Build phase'. Which is a bit surprising and something I'll try to
look at. (I currently have a memory dump that I'll need to analyze.)

After that point, I've seen it seemed to have a slow growth up to about
600MB, though mostly at that point it just hovered. Going up a bit, but
then dropping, etc.

It definitely seems a bit slower at the end than at the beginning, and I
wonder if GC overhead isn't hurting us in here. We probably have some
sort of reference cycle, which uses maybe 25MB of ram, which gets
cleaned out when GC runs, but having that means we spend more time
*running* GC.

It took about 22 minutes to complete, which is comparable to what you
saw. (that also includes a couple of interruptions to create a dump of
memory consumption.)

> 
> I'd also like to share the mirror with other local machines to skip
> downloading the same 500 MB many times.  Running bzr serve and then
> checking out causes 100 % CPU usage for more than 10 minutes on the
> host.
> 
> These numbers were with 2.2b4.  2.2 is significantly better than 2.1.
> 
> -- Michael

I would expect a local network checkout to be CPU bound versus
bandwidth/latency bound. Though it is a bit of a shame to see it so
heavily so. I have a reasonable feeling for where we are the most CPU
bound. And I do have a some idea for how to make it a little bit better,
but I don't have great answers.

I will say that using a shared repository and a lightweight checkout
that you switch around should help for a lot of your day-to-day results,
even if it doesn't solve the initial branch issues.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxQkE0ACgkQJdeBCYSNAAPfKwCgxzbUAMzxkr9+VWFCg+L3eMya
vfgAoJtspNch1llgmwtRcJo4q7Jcia8l
=Apr7
-----END PGP SIGNATURE-----