Creating a roadmap for improving bzr's performance.

Thu May 10 08:29:36 BST 2007

On 5/7/07, Robert Collins <robertc at robertcollins.net> wrote:

> What should the final system look like, how is it different to what we
> have today?
> -----------
>
> One of the things I like the most about bzr is its rich library API, and
> I've heard this from numerous other folk. So anything that will remove
> that should be considered a last resort.
>
> Similarly our relatively excellent cross platform support is critical
> for projects that are themselves cross platform, and thats a
> considerable number these days.
>
> And of course, our focus on doing the right thing is what differentiates
> us from some of the other VCS's, so we should be focusing on doing the
> right thing quickly :).
>
> What we have today though has grown organically in response to us
> identifying bottlenecks over several iterations of back end storage,
> branch metadata and the local tree representation. I think we are
> largely past that and able to describe the ideal characteristics of the
> major actors in the system - primarily Tree, Branch, Repository - based
> on what we have learnt.

Yes, "just what is Bazaar?" is a question that came up in my mind
while thinking through some of the use cases.  We want to be a system
that's fast and pleasant to use on trees of any size, and I think to
get there we can't just change minor internals- the ui, formats, and
architectural concepts may need to change.

One case in point is commit's behaviour of listing
add/move/delete/modify records as it commits, which requires doing a
tree comparison that's not strictly necessary for commit.  This is a
ui nicety, and a nice ui is important to us, but not at any price.

While anything is potentially on the table here are the things that
are most important to me (recapping some from Robert), in random
order:

 * good library interface; approachable codebase
 * flexibility to vary these layers and do bzr-svn, new formats or
similar hacks -- I love that you can add a new repository and test if
it fulfils that interface
 * safety
 * testing and testability
 * cross-platform support
 * being understandable/transparent - when something does go wrong,
letting people recover and/or make a good bug report
 * storage makes minimal assumptions about the filesystem, enough that
it works on ntfs or over ftp or sftp or nfs -- for example wanting to
mmap files would go against this, using os locks on dirstate has
stretched it
 * read only operations can be done on physically readonly transports
 * simple operations are simple: init, add, commit stays like that --
init-repo is already stretching this a bit by making people think
about storage optimization
 * supporting Ubuntu developers - quickly getting trees they need to
work on, allowing for good imports from cvs etc, allowing space to
link to bugs
 * recording and using file and directory renames
 * tracking directories (incl empty), symlinks, etc
 * not refusing to do anything the user might reasonably expect us to
do -- such as the file-id or tree-cleanliness errors one would get
from arch
 * the basic design is a DAG of revisions, each referring to the state
of the tree at a point in time
 * biased towards source-like trees, but able to handle arbitrary data

(In this thread I'm only mentioning in things that might trade off
against speed or influence how we get there, but that doesn't mean I
don't appreciate other qualities.)

And here are some things that are well established in Bazaar, but I
would say very much up for consideration:

 * one VersionedFile per file id, both as an object in memory and on disk
 * an Inventory object distinct from Tree
 * caching line-by-line annotations at commit time
 * recording per-file graphs - this is largely redundant with the
whole tree graph, and I think only used for per-file log?
 * using file ids for most internal apis - they're the easiest way to
make sure we follow file identity across revisions, but for eg
workingtree operations it's perverse to go to ids and back
 * naming file versions by random ids not hashes

-- 
Martin