Thoughts on index building

Tue Apr 21 05:06:50 BST 2009

John has found that index building is a hot spot for dev6. This is
because we have a a 1-2 OOM increase in keys in the index.

Now, there are several things happening at the same time: these new keys
are much more distributed over their keyspace, of fixed length, and
there are many more of them.

One of the outstanding 'may do' things is a new index. I wanted to put
some analysis down around the issues indices will have - no point making
a new index style that still suffers the same issues we have today.V

At 100K keys we spill to disk - this involves serialising the current
index in a slightly cheaper form, and then treating it as a fallback
index for reads, but otherwise ignoring it until the index is finalised
or we hit a power of two and recombine it again.

We're seeing 400K+ keys, so this will:
Write one stripe of 100K
Write a stripe of 200K (coalescing the first stripe with memory)
Write another of 100K
Then one of 400K (coalescing the first two stripes with memory)
And finally at 440K or whereever we precisely stop, we'll read the index
in entirely again to stream out the final version.

So to write 440K keys we have written 1240K keys, and read a minimum of
700K if no lookups were done.

Why do we do this? Scalability and integrity.

On the integrity side we have a constraint that any one index can only
record one value for a given key. To avoid trigger this we often query
for keys before adding them. This is maybe less important than we
thought it was, as long as the value of all the references items is the
same. OTOH avoiding duplicates where we do care requires the same
facility, so making querying of in-progress indices cheap is worth doing
I think.

On the scalability side we spill the in memory index to disk to avoid
chewing up too much memory. Particularly with very large key counts
(I've been testing over the weekend with 6M key indices). Even without
considering dictionary packing overheads or the value data, 6M sha1 keys
is going to be 120MB.

Even if we build a sha1 specific index we will probably want/need to
limit memory use in very large history/tree environments.

Now, consider what happens as we insert sha1's into an index that is
spilling to disk: because there is no locality of reference probing for
pior node existence is going to thrash the backing index cache.
Specifically, if the cache can hold 10% of the index, each new key we
check for has a 10% chance of being in the cache because hash's jump
around a lot :). Now, the internal nodes are likely all cached even in
very large indices; and even when they aren't they are still likely to
be getting *some* cache hits.

So as the index size grows, we will converge on doing backing-index
reads per key lookup that is done. We'll need to do backing-index reads
in the common case because we're looking up keys that don't exist, so we
can't short circuit.

Backing index upper limit = log(keys inserted/100K, 2)
-> to check 1 key scales with log(N)
-> to check N keys is (roughly) NlogN read-and-parse operations.

An lsprof of inserting 1.49M keys into a btree index gives some
interesting info. The test data is 2-tuples with the second key element
often duplicated - e.g. (A, B), (C, B), (D, B). The insertion order
scans across the first key element space, and across the second
regularly - so it looks somewhat similar in terms of locality of
reference to CHK insertions.
95% of time is 1.54M queries (this test data has some duplicates)
3% of time is 1.49M insertions
1.38% is the final read-and-serialise to the finished index.

What can we do to fix this? 
 * Keep enough data in ram to avoid disk IO
 * Make index checks much cheaper
 * Defer duplicate checking to the finalise step
 * ...

For the first option, there are some possibilities - more may exist :).
We can look at using a bloom per backing index, or something along those
lines, which allows a 98% or so avoidance of disk IO for missing keys.
John and I have an experiment going on for that at the moment (but I'll
be leaving it up to John to finish it - the network awaits). We could
also look at keeping some prefix of the key (or a prefix of a hash of
the key for non-CHK keys) in memory, which may scale reasonably well. To
deal with 2M keys distributed homogenously we only 33 bits for
reasonable selectivity. A bloom in theory only needs 8 bits though, so
its very attractive.

We probably can't make index checks a *lot* cheaper without changing the
format (but changing the format is possible - we're working towards a
new format after all). We might be able to do something in C with a
focus on not parsing the full node.

Deferring duplicate checking to the finalisation would cause late errors
- rather than 'X is present already' when inserting X, we would instead
get 'X is duplicated' when finalising. OTOH it completely avoids
thrashing the backing index cache while creating it.

An new index that is tailored to CHK keys won't avoid the scaling issues
with values - we probably still need spill-to-disk while building the
index. It may be a good idea regardless simply to get more compact
indices.

That said, I think we should be able to get solid performance out of the
B+Tree index we have today, it has no particular reason to be bad as
long as we don't do things that are intrinsically bad to it :)

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090421/c71e35b6/attachment-0001.pgp