[RFC] pack index lookup performance

Tue Apr 15 08:08:42 BST 2008

So the pack index lookup stuff has been a repeated bugbear. I couldn't
find the thread in my mail client, so this is a new one.

I had a chat with Martin and he suggested instrumenting bzr somewhat to
help pin things down. I think this is a useful and important thing to
do, but I won't be doing it in the immediate future. I hope that the
following notes about the index layout and so on will help someone else
both instrument bzrlib better for this (or know how to use the existing
instrumentation that already exists).

We have the following data points:
 - index lookups are slower than desirable 
 - pack makes high level commands faster 
 - reading the entire index so that key lookups are dict lookups
   is faster than the current code (when a large fraction of history 
   is being read).

The first thing we need to get a real handle on this. Is the in-memory
lookup code buggy-and-slow, or being asked to do more work than is
necessary?

To go back to basics on the design:
We have P packs in a repository. Each has len(P) records, which are
arranged in a  roughly log10 distribution. We lookup keys linearly
across the P packs in an arbitrary order, and without mutating that
order. (we don't move the found-index to the front of the list or
anything like that).
A lookup of a key k (which is present) in a repository then has a
1/len(P) probability of being in each of the P packs. If the lookup was
ordered with smallest packs first, a 'ready-to-repack' (having 9 of the
largest 10^ sized pack, 9 of the next smaller size etc) repository will
on average look in 5 packs, if k is chosen randomly. (because 90% of the
keys are in at most 9 packs). If k is biased to the most recent
insertions then we will look in all but 5 of the packs on average (as
most recent insertions are in the smallest packs that consume up to 10%
of the repository). 
A lookup in a single pack requires a bisection search within the index.
First the header and the middle 4K (64K for remote packs) are read.
Then we continue to read in 4K(or 64K) regions until we have no gap for
the key to be in, or the key is found. The number of disk bisections
required is on average log(len(P)*4K/index_row_length). 
That is, a single bigger pack requires (on average) less IO to find a
key K than K single-item packs (which needs K reads).
An important thing to note is that a miss lookup on disk requires as
much IO as a hit - we have to ensure that the key is not present.

So using the statistics on pack frequency, on lookups a key, one index
will see the key it has, but on average all but 5 of the index objects
will have requests for keys they don't have, and have already done the
disk IO for. (ordering the packs by descending record count would change
this to up to 9 indices seeing misses for 10% of the keys, that will
come after, and within those 9 a linear fraction of the 90% of the
keyspace they hold). So repeated lookups will make many more
miss-lookups than hit-lookups.

Having the packs in arbitrary order makes it hard for me to estimate the
needed lookups, so it may be worth making the order explicit just for
modelling and assessment purposes.

Now the disk IO made by indices is well tested; I'm confident we do not
repeat IO's or access data on disk inappropriately. I may be wrong --
but there are tests to look at for this.

Some interesting things to identify before changing things:
 - the number of miss-lookups during e.g. log() on a per-index basis
 - the ratio of hit-lookups to miss-lookups on a per-index basis.
 - the relative timing of miss lookups vs hit lookups. This will be
   tricky due to the user of generators and do-many-thing methods.
 - the total lookups made in the index layer.

Now, without cutting of avenues of investigation, I strongly suspect
that during a log run, where at least 15K lookups are made, the vast
number of miss lookups are multiplying any inefficiency in the
miss-lookup case, both when disk IO is needed, and when it has been
performed and is now cached. Confirming the hit rate would help confirm
this hypothesis.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080415/63924f3e/attachment.pgp