Bazaar repository size benchmarks

Ian Clatworthy ian.clatworthy at internode.on.net
Tue Jun 3 12:56:45 BST 2008


Pieter de Bie wrote:
> On Mon, Jun 2, 2008 at 1:40 AM, Ian Clatworthy

> I will take a look at what I can put online; the repositories are
> quite big and I don't have a fast upload. Perhaps I can upload some of
> the smaller repos (that is, not the mozilla one)?

Thanks.

> Do you mean you use hard links for the working tree? Won't that edit
> two repositories if you use an editor / command that edits files (as
> opposed to delete/create?)

You need to be using an editor that breaks hard-links on save to use this
feature safely. Or a more manual process, of course.

>> I'd like to analyse where the space is being used in the repositories you've
>> generated. I *think* it's probably in inventories but I'd like to confirm that.
>> To give an idea of the difference possible, on one test repository with 6000+
>> revisions (wordpress), I'm seeing a variation from 17MB to 78MB depending on
>> how often inventory fulltexts are stored. The current fastimport algorithm -
>> create an inventory fulltext every 200 and only every 200 - gives a repository
>> size of 17.5MB so it's acceptable on my test repository but could well be
>> lousy on other data sets.

Once again, thanks for doing this benchmarking. I had been planning to
soon so this is a great start. Thanks also for the tweaks to fastimport
and fastexport. I've incorporated one already and I'll take a look at
the others when I get a moment.

In general, we do have plans to improve our storage efficiency on
deep histories - it just isn't top of the list right now. There are at
least two areas we know we can improve a lot - diffs of binary files
and inventory storage.

I'm sure your benchmarking was sound but I'm curious about a few of
the results. The Mozilla benchmark doesn't gel with the figures given
in http://www.infoq.com/articles/dvcs-guide. Any ideas why? I'd also
like to know more about the heritage of the Emacs repo. Was it
converted by bzr-svn or by fastimport? The former uses a different
scheme for revision-ids. I wonder if bzr-fastexport/bzr-fastimport
would change the size or not?

FWIW, I tweaked bzr-fastimport yesterday to have a new parameter that
controls how often an inventory fulltext is stored. That might prove
useful on repositories with deep histories. I also noticed in my
experiments that bzr-fastimport can give repos of quite different
sizes for *some* front-ends depending on how it's run. In particular:

  bzr fast-import ../wordpress.fi -> 17.5MB
  bzr fast-import ../wordpress.fi --info ../wordpress.cfg -> 20.5MB

That's *bad* and indicates a bug, possibly in the blob caching?
I have no idea whether that bug is impacting your results or not.
I'm really busy for the next week or two so I honestly can't look
into this right now. I do promise to come back to fastimport and
space efficiency in general once other priorities are addressed.
If anyone wants to jump in and investigate more before then,
please go ahead.

Ian C.



More information about the bazaar mailing list