[RFC] pack index lookup performance
Robert Collins
robertc at robertcollins.net
Tue Apr 15 21:48:28 BST 2008
On Tue, 2008-04-15 at 19:54 +1000, Andrew Bennetts wrote:
> Robert Collins wrote:
> [...]
> > To go back to basics on the design:
> > We have P packs in a repository. Each has len(P) records, which are
> > arranged in a roughly log10 distribution. We lookup keys linearly
> > across the P packs in an arbitrary order, and without mutating that
> > order. (we don't move the found-index to the front of the list or
> > anything like that).
> > A lookup of a key k (which is present) in a repository then has a
> > 1/len(P) probability of being in each of the P packs. If the lookup was
> > ordered with smallest packs first, a 'ready-to-repack' (having 9 of the
> > largest 10^ sized pack, 9 of the next smaller size etc) repository will
> > on average look in 5 packs, if k is chosen randomly. (because 90% of the
> > keys are in at most 9 packs).
>
> I don't follow your logic here, maybe there's a typo? Don't you mean "if the
> lookup was ordered with the *largest* packs first"? Otherwise out of N files,
> you're going to look at N-9 of them first and still have seen only 10% of the
> keys.
yes, typo.
> [...]
> > So using the statistics on pack frequency, on lookups a key, one index
> > will see the key it has, but on average all but 5 of the index objects
> > will have requests for keys they don't have, and have already done the
> > disk IO for. (ordering the packs by descending record count would change
>
> That's an extremely hard to parse sentence. Is it just restating the earlier
> deduction that a single lookup will probably hit all but 5 packs (with an
> apparent unstated assumption of smallest-to-largest search order)?
I think so :P I should have reread the whole mail before sending.
> [...]
> > Some interesting things to identify before changing things:
> > - the number of miss-lookups during e.g. log() on a per-index basis
> > - the ratio of hit-lookups to miss-lookups on a per-index basis.
> > - the relative timing of miss lookups vs hit lookups. This will be
> > tricky due to the user of generators and do-many-thing methods.
> > - the total lookups made in the index layer.
>
> Also interesting may be the % of the index that has been read in by the end of
> the operation, again per index.
Indeed. I would expect this to approach unity for log() and other
most-of-history operations.
> And perhaps the number of bisection iterations overall. Off the top of my head
> I'd expect a 'ready-to-repack' repo to have a moderate but not crazy increase in
> the total number of bisections vs. a single pack file. Where a repo with a
> single pack file needs to do (log n) bisections to search for a key, a
> ready-to-repack one may need to do something closer to (log n × log n)...
> although that's still much better than linear.
Indeed; this is where the hit/miss ratio is useful because hits get
cached per index, but there is no miss cache at the moment.
So if you ask a 1-record index that has been fully parsed for 15K keys
it does not have, it does 15K bisections.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080416/48285e73/attachment.pgp
More information about the bazaar
mailing list