Bzr and large repositories

Thu Oct 26 14:15:07 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nicholas Allen wrote:
> 
> 
> Martin Pool wrote:
>> On 26 Oct 2006, Nicholas Allen <nick.allen at onlinehome.de> wrote:
>>> Hi,
>>>
>>> I remember reading a while ago that bzr used huge amounts of memory
>>> on large repositories. Is this still the case? We have a very large
>>> svn repository (about 40,000 revisions and many Gb in size). Do you
>>> think this will be hard for bzr to handle?
>>
>> That would be hard for bzr 0.12.  We're working on improving Bazaar to
>> scale up to repositories of that size or larger.
> 
> What do you think are the bottlenecks at the moment? eg. is it in
> updating the working tree after a pull or is it branching from a large
> repository that would be the problem?

The biggest bottleneck ATM is the inventory file (manifest for other
VCS). If your repository can be broken up into logical sub-projects, I
don't think it would be a big issue.

In brief the problems with inventory are:

1) With 100K entries it gets pretty big (it is around 200+ bytes per
entry) So 100K entries gets to be about 20MB

2) Our default algorithm for knits saves a full text every 26 revisions.
 Which is sub-optimal for inventories because they are very large, and
on average have very small changes each time. It would be *very* easy to
tweak this to a different number, and only slightly harder to change it
so that it does something more like "keep a new full text only when the
size of all deltas == the size of a full text".

3) We are looking into splitting up the inventory into per-directory
inventories, so that you never have 1 big file that always changes.
'git' uses this method, though 'hg' claim that they have much better
performance because they have an all-in-one and don't have to go around
and update N files.

4) bzr isn't 100% optimal with how it handles reading inventories, and
managing them in memory. It builds them into an object representation,
which turns out to be a lot more expensive than just keeping them as
strings, tuples, lists, and dictionaries. (in a quick test, it costs
.3us to create a tuple, 1us to create a list, 8.5us to create a object,
and 7.8us to create an object using __slots__) So switching to tuples
when possible can see a 20-30x performance improvement when you have
lots of them to create.

Also, there is a little bit more information that should be stored in an
inventory. Specifically, we want to store either the sha hash, or at
least the last-changed revision of all children of a directory. So that
when comparing 2 inventories, we can look at the top level
directory(ies) and know whether we have to compare anything underneath,
or whether that whole part of the tree has not changed.

It is a pretty simple data change, which should help a lot when
comparing large trees for changes.

> 
>>
>> If you can say more about the character of the repository it would be
>> interesting, such as
>>   how many files and directories are there in one version of the source
>>   tree?
> 
> We have around 100,000 files of which about 20,000 are directories.
> 

If you need all of that in a single versioned tree (rather than
splitting it up by project), then it is probably outside of what would
be comfortable in bzr today. (The freebsd ports tree has >100K files,
and 160K revisions, and it is a little to big for us right now).

>>
>>   what's the distribution of individual file sizes - average and
>>   maximum?
> 
> Average size is probably around 30kb.
> 
>>
>>   how many branches, developers, and commits per day?
> 
> 15 developers and maybe 20-30 commits per day.

I don't think this is a specific problem for bzr. Especially as everyone
can have their own branch to commit on. I personally commit as much as
20-30 times per day on the bzr code base :) Okay so it may be more like
10 or so, but I try hard to follow the 'every commit is a small logical
change' philosophy, so my commit rate is quite high.

> 
>>
>>   how does Subversion do with this load?
>>
> 
> We have no problems with svn at all under this load.
> 

Out of curiosity, are you using the fsfs storage or bsdb? I was talking
with an SVN dev the other day, and they were talking about scaling
problems with fsfs. (Basically how they store indexes is poor, both
because it is put at the end of the file, and because there aren't good
external indexes, so they end up having to seek through lots of files,
and seeking around inside files a lot more than they should). But I'm
not sure what O() they were talking about scaling to.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFQLTbJdeBCYSNAAMRAk+eAKDABNlduQjSpIF7j/JJl6pzhinb/QCg026f
BOadiRGpW0cf08CiJeM8YME=
=UJQc
-----END PGP SIGNATURE-----