[RFC] pack index lookup performance

Tue Apr 15 10:54:57 BST 2008

Robert Collins wrote:
[...]
> To go back to basics on the design:
> We have P packs in a repository. Each has len(P) records, which are
> arranged in a  roughly log10 distribution. We lookup keys linearly
> across the P packs in an arbitrary order, and without mutating that
> order. (we don't move the found-index to the front of the list or
> anything like that).
> A lookup of a key k (which is present) in a repository then has a
> 1/len(P) probability of being in each of the P packs. If the lookup was
> ordered with smallest packs first, a 'ready-to-repack' (having 9 of the
> largest 10^ sized pack, 9 of the next smaller size etc) repository will
> on average look in 5 packs, if k is chosen randomly. (because 90% of the
> keys are in at most 9 packs).

I don't follow your logic here, maybe there's a typo?  Don't you mean "if the
lookup was ordered with the *largest* packs first"?  Otherwise out of N files,
you're going to look at N-9 of them first and still have seen only 10% of the
keys.

>                               If k is biased to the most recent
> insertions then we will look in all but 5 of the packs on average (as
> most recent insertions are in the smallest packs that consume up to 10%
> of the repository). 

That makes more sense.

[...]
> So using the statistics on pack frequency, on lookups a key, one index
> will see the key it has, but on average all but 5 of the index objects
> will have requests for keys they don't have, and have already done the
> disk IO for. (ordering the packs by descending record count would change

That's an extremely hard to parse sentence.  Is it just restating the earlier
deduction that a single lookup will probably hit all but 5 packs (with an
apparent unstated assumption of smallest-to-largest search order)?

[...]
> Some interesting things to identify before changing things:
>  - the number of miss-lookups during e.g. log() on a per-index basis
>  - the ratio of hit-lookups to miss-lookups on a per-index basis.
>  - the relative timing of miss lookups vs hit lookups. This will be
>    tricky due to the user of generators and do-many-thing methods.
>  - the total lookups made in the index layer.

Also interesting may be the % of the index that has been read in by the end of
the operation, again per index.

And perhaps the number of bisection iterations overall.  Off the top of my head
I'd expect a 'ready-to-repack' repo to have a moderate but not crazy increase in
the total number of bisections vs. a single pack file.  Where a repo with a
single pack file needs to do (log n) bisections to search for a key, a
ready-to-repack one may need to do something closer to (log n × log n)...
although that's still much better than linear.

-Andrew.