[RFC] pack index lookup performance

Tue Apr 15 21:48:28 BST 2008

On Tue, 2008-04-15 at 19:54 +1000, Andrew Bennetts wrote:
> Robert Collins wrote:
> [...]
> > To go back to basics on the design:
> > We have P packs in a repository. Each has len(P) records, which are
> > arranged in a  roughly log10 distribution. We lookup keys linearly
> > across the P packs in an arbitrary order, and without mutating that
> > order. (we don't move the found-index to the front of the list or
> > anything like that).
> > A lookup of a key k (which is present) in a repository then has a
> > 1/len(P) probability of being in each of the P packs. If the lookup was
> > ordered with smallest packs first, a 'ready-to-repack' (having 9 of the
> > largest 10^ sized pack, 9 of the next smaller size etc) repository will
> > on average look in 5 packs, if k is chosen randomly. (because 90% of the
> > keys are in at most 9 packs).
> 
> I don't follow your logic here, maybe there's a typo?  Don't you mean "if the
> lookup was ordered with the *largest* packs first"?  Otherwise out of N files,
> you're going to look at N-9 of them first and still have seen only 10% of the
> keys.

yes, typo.

> [...]
> > So using the statistics on pack frequency, on lookups a key, one index
> > will see the key it has, but on average all but 5 of the index objects
> > will have requests for keys they don't have, and have already done the
> > disk IO for. (ordering the packs by descending record count would change
> 
> That's an extremely hard to parse sentence.  Is it just restating the earlier
> deduction that a single lookup will probably hit all but 5 packs (with an
> apparent unstated assumption of smallest-to-largest search order)?

I think so :P I should have reread the whole mail before sending.

> [...]
> > Some interesting things to identify before changing things:
> >  - the number of miss-lookups during e.g. log() on a per-index basis
> >  - the ratio of hit-lookups to miss-lookups on a per-index basis.
> >  - the relative timing of miss lookups vs hit lookups. This will be
> >    tricky due to the user of generators and do-many-thing methods.
> >  - the total lookups made in the index layer.
> 
> Also interesting may be the % of the index that has been read in by the end of
> the operation, again per index.

Indeed. I would expect this to approach unity for log() and other
most-of-history operations.

> And perhaps the number of bisection iterations overall.  Off the top of my head
> I'd expect a 'ready-to-repack' repo to have a moderate but not crazy increase in
> the total number of bisections vs. a single pack file.  Where a repo with a
> single pack file needs to do (log n) bisections to search for a key, a
> ready-to-repack one may need to do something closer to (log n × log n)...
> although that's still much better than linear.

Indeed; this is where the hit/miss ratio is useful because hits get
cached per index, but there is no miss cache at the moment.

So if you ask a 1-record index that has been fully parsed for 15K keys
it does not have, it does 15K bisections.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080416/48285e73/attachment.pgp