Plans, please, not excuses [was: Primer ... ?]

Robert Collins robertc at robertcollins.net
Tue Mar 25 23:03:07 GMT 2008


On Wed, 2008-03-26 at 04:21 +0900, Stephen J. Turnbull wrote:
> 
> Now, when Robert Collins says that the "pack" repo format is a good
> foundation for performance and scaling work, that's good enough for
> me.  But maybe it would be a good idea (from a PR standpoint) for
> somebody to explain why that is so and what changes need to be made to
> what bzr commands.  Otherwise, I don't see why anybody should believe
> it until they see it.  Hopefully that would wet down some of the
> flames on emacs-devel....

So, I *hope* what Ian meant about starting late, was starting with
performance as a key evaluation on changes late. Clearly bzr predates
git and hg by some months as you say.

What we need to do to fix various bzr commands performance can be broken
sensibly into two parts I think:

The first part is that operations like 'merge' perform better when the
use methods on Repository that request a set of related data, rather
than repeatedly asking for data - this is because we can more
intelligently plan the IO we perform. Aaron has some fantastic
performance improvements done using this approach, which we hope to land
soon. So for a given command that is slow because it performs many round
trips when less would do - we need to switch over to better method
calls.

The second part is more tricky but equally important. While we were
focused solely on getting a really smooth UI, and largely ignoring
performance, we ended up writing a lot of O(history) and O(files) code.
This obviously scales terribly, and I really have no explanation for why
I considered it ok at the time. These need to be rewritten to only
consider locally relevant data. One example of these, which has been
very visible in the emacs discussion, is the time for 'bzr log' to start
outputting data.


The pack format provides us with a storage layer that:
- only rewrites a single file - the root index file; no other files 
   are rewritten or appended to in the repository
- has ~log10 file count growth with number of revisions in the store
- has no file count growth with the number of versioned files, or
   versions of the same file
- has partially readable indices (using bisection to locate keys)
- supports multiple writers

So unlike any of the preceeding bzr disk layouts we have two critical
facilities for fixing our scaling related performance problems:
- sublinear file count growth in the database as additional files are
  versioned, or commits created. This matters because performance over
  dumb protocols (including local disk) is extremely sensitive to
  latency.
- we can work with some portion of the revision graph (or indeed file
  graph) without paying a linear scaling cost with the size of the 
  repository. 

Which is why packs are a good basis. As for specific plans; I would say
we as a group know how to fix performance for most/any given command,
and which ones are most important will likely be driven by the
[prospective] users complaining most.


-Rob



-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080326/b36620cb/attachment.pgp 


More information about the bazaar mailing list