bzr.dev performance, release performance, constraints and accommodations

Mon Dec 3 20:25:18 GMT 2007

So in a couple of bugs, irc and email we've had a conversation recently
about performance of specific commands in bzr.dev. Martin and I also
spent some time on the phone debugging what the issues were as we saw
them. I thought I'd write up my perspective in a summary so that the
information doesn't decay.

I don't think the individuals putting patches forward etc did anything
wrong at any point, so I'm going to present this in a somewhat abstract
form. It's also got many threads interacting here, making it somewhat
hard to get a good narrative. I'll try to formulate it as that, but
there is also a bunch of stuff I need to get done today, so there may be
a tradeoff involved with this too :).

With bzr.dev's default format set to packs, developers running bzr.dev
have started dogfooding more widely.

A number of places have been observed to get slower. [e.g. log].

And we're about to release; this is a problem if our users see a
slowdown to unacceptable performance. 

But wait, isn't *any* slowdown a problem?

No - and heres the first key point. Consider performance. We have a
classic 'any two' triangle:
Memory use
Round trips
Data transferred.

If we want minimal data transfer, we pay with more round trips and a
memory buffer. Want less memory used? pay with more round trips and
repeated data transfer. Want less round trips? pay with transferring all
your data and buffering everything in memory...which is the choice we
made for knits. In March last year many bzr developers got together in
London, and we created a plan whereby we would try our hardest to avoid
*any* of that triangle's corners. In the corners we break, and we break
hard as tree size, or history depth, or text sizes go up.

At the storage layer knits present a local maxima for specific
operations, even though it is globally pessimistic in many other cases.
They are extraordinarily fast at index operations once loaded. There is
a constraint here 'Any access of a .kndx will incur size of all history
data transfer and memory buffering'.

As a result of this constraint (read all data, be fast at index
operations afterwards) many parts of bzrlib have been skewed - they have
accommodated it - they have starting making use of the extremely fast
index access to avoid the overhead of actually accessing only the needed
data - because that overhead is *dead weight* once you've paid the
full-history price that knits impose.

And here's the reason *any* slowdown is not necessarily bad. Some things
cannot ever be as fast as they are on moderate sized knit repositories
in an alternative design: knits have the history graph in a single file,
and always do a single IO call to process it. packs have N history graph
indices, so *when* accessing all history is needed, can at best pay N
times overhead to achieve the same result.

We have to assess the impact of the format change in terms of what we'd
like for performance, not just the relative change to knits. (Of course,
a slowdown is not /desirable/, its just in some cases going to be
inevitable).

Based on the March plan, api's that require:
 - holding all data in memory
 - processing all history
 - processing entire tree data

are considered evil; they are things we want to avoid. API's that do
this prevent us improving things, and they put evolutionary pressure on
our code to *stay the same*, because those api's are geared to
particular corners of the performance triangle.

Now, with the defaults switched to packs, places that are scaling to
size of history are suddenly slower. They are slower because the current
pack index is substantially slower at accessing all of history. This is
somewhat a bug (the bisection code is improvable), and somewhat
deliberate - packs do not require buffering all history at load time,
instead they allow partial index data access, and are significantly
*faster* than knits at accessing a part of the history... and our desire
for scalable performance is that we only ever access part of history
except when doing an operation that the user can understand as requiring
all history. (e.g. making a brand new branch).

So, what's this all about. Well, we, as developers, are now seeing X
places get slower. And a casual look with a profiler at a slow operation
on bzr.dev will typically show something like:
16K calls to bzrlib.index.GraphIndex.iter_entries()

This is 16 thousand calls to access history data. Do we need that?
Almost certainly no. There will be something above this in the call path
that is inappropriately accessing all the history.

We have two basic routes to 'fix' the performance problem.
1) Reintroduce the constraint that knits had - buffer all the index data
upfront, and allow our adapted code to carry on unmodified.
2) Fix the root cause.

What about releases? Surely we *have* to do buffer everything to be fast
enough for a release. Pragmatically, yes we do. Until the root causes
are fixed, we'll see unacceptable performance hits by removing the
constraints we had. And the reason for this is simple: Until *both* the
constraint and the accommodation of it are removed, our performance
won't increase. Once both the constraint and the accommodation *are*
removed we can expect big performance jumps (except where we actually do
have to scale to size of history :)).

An important thing to note though, is that for users with large
trees/large history/ the existing performance *is bad*. We're not
solving their problems *until* we solve the root cause.

Again, what is this all about?

bzr.dev should not have performance-band-aids merged to it. Such
band-aids should only go to release branches.

There are two primary reasons for this. One is social - I have a theory
that unless the accommodations made by bzrlib are exposed, developers
will not actually fix them.

As a practical example consider the raft of performance fixes for
commands I don't run every day that have turned up as more developers
start using packs.

This is best summarised as 'If it doesn't itch, we don't scratch it.'.

The other reason is to do with design quality:

In bzr.dev we have 3 metrics for code:
 * Doesn't reduce test coverage: if it adds new methods or commands,
   there should be tests for them.  There is a good test framework
   and plenty of examples to crib from, but if you are having trouble
   working out how to test something feel free to post a draft patch
   and ask for help.

 * Doesn't reduce design clarity, such as by entangling objects
   we're trying to separate.  This is mostly something the more
   experienced reviewers need to help check.

 * Improves bugs, features, speed, or code simplicity.

API's that *embed or extend* the corners of the performance triangle
into our code base can reduce our design quality and need to be
extremely closely examined. If the tradeoff is being made to *preserve
other buggy code* then we're changing the wrong part of the code base.

And in the light of this, this is why I rejected the patch the other day
for bzr.dev, because:
 - it converted a bad size-of-project-history api into a
   size-of-all-repository api
 - it embedded the concept of 'buffer in memory' into a layer which had
   no such public constraint

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071204/3fb27153/attachment-0001.pgp