New toy for bzr developers: B+Tree index sketch

Tue Jul 1 23:23:56 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Tue, 2008-07-01 at 14:07 -0500, John Arbash Meinel wrote:
|
|> | Not tested yet:
|> |  - high latency (NFS or Remote or http) links
|>
|> Did you have simple scripts to make it easy to reproduce your results?
|
| yeah, I sent that at midnight; brain not worky so well. I'll add a
| couple of extra things and make it more parameterised and attach it
| later today.
|
| Thanks for spending time tuning the size. I don't think 4K is
| necessarily best; Indeed, 64K would clearly be better in terms of
| accessing remote servers *if* we get a high hit rate. (Reading 128K to
| access one key/determine one key is missing is bad compared to 8K; we
| want lots of accesses to the same node to accommodate the overhead of
| the read).

I actually think doing 4k pages would be rather good for local access,
and over remote we could just read multiple pages at a time.

Also, depending on how you store them, we may or may not actually have
locality. If they are topologically sorted, we have a lot of locality,
lexicographically, I'm not really sure. We might have some locality, as
all of the robertc at robertcollins.net revisions are close together, but
all of the merges with pqm@ will be far away.

|
|
|> John
|> =:->
|>
|> PS> It would all be so much simpler if we could just snapshot the state
|> of the compressor. Do the flush, restore the state, etc. zlib itself
|> supports it with deflateCopy(). Though it does warn:
|
| there is a .copy() on compressobj - I checked the python code :). But it
| depends if the zlib present when python is built has deflateCopy, so I
| felt it was likely to give use installation issues.

It isn't exposed at the Python level, so we would need some sort of
hackery to get it exposed.

|
|> ... Note that deflateCopy duplicates the internal compression state
|> which can be quite large, so this strategy is slow and can consume lots
|> of memory.
|
| OTOH zlib is used for very large data sets, and our typical indices are
| not; individual pages are definitely not.

Yeah, and the definition of "slow and consume lots of memory" can be a
very different time scale. And I'm *sure* it is faster than starting
from scratch and compressing everything to this point. :)

|
|> I also tried switching to 8192-sized pack files. And with my code, it
|> drops the size to 3.2MB. So about a 3% savings. Certainly not nearly as
|> much as just trying to pack more into the existing 4096 bytes.
|
| -Rob

As for putting this in the header... I actually was thinking the same
thing. I don't really know how you could ever read a random page in the
index, without first reading the header anyway, as you need the
information stored there, don't you? (Number of each level of node, etc.)

I was thinking that "bzr pack" could possibly tune this sort of thing.
We might try really big node sizes and see if there is any gain. A
tunable parameter that only tunes <10% isn't worth it. Only if we found
that going up to 64K pages suddenly gives us 50% the size would I really
worry about it.

We might also consider using bz2. I didn't find any specific reason
(compressing *all* revision ids in bzr was 3.5:1 vs 3.1:1. Only a 13%
gain, at the cost of bz2 overhead.) LZMA comes to mind, but my
understanding is it is mostly gzip/zlib with a much larger window. So it
probably won't give a specific benefit for our pages.

By the way, if you are interested in having me do it, I would be
interested in tuning the 'iter_entries()' lookup, to allow it to search
for multiple keys at a time, rather than doing the bisect per key.

Also, the depth of the tree is known ahead of time, right? So for all
keys we have the same number of steps in the bisect. Which makes the
loop nice and easy to write.

I also just had an idea. Right now for the btree, we have N+1 possible
children. Specifically you have:
~       A           B
~   /        |             \
before A  between A&B    after B

I wonder if we could have a win doing the btree as:

~      A        B        C
~      |        |        |
~      A...  ...B...  ...C

My idea is that if your key is <= the shortest key, you can stop,
because you know it isn't present. Same thing for key >= largest key.
So you lose a +1 fan out, but you gain the ability to cull keys before
you reach the bottom.

As one of the problems (we believe) with our current indexes is the
overhead for *misses*, this might be a big win.

This isn't something that would be compatible with your current code, as
it has to change the reader and writer. And it probably means we don't
have a 'pure' b+tree.

However, I get the feeling that b-trees were structured around *hits*
and not misses.

Also, note that we really need to be testing against a real-world
repository like I had. With 18+ pack files. (I had 22648 nodes, so 22
was our 'optimal' point.)

I feel like our current index does decently when there is 1 pack file,
and starts to tail off dramatically after that.

I also wonder about combining searches across pack files, but because of
how things are sorted, I'm not sure that is a win. (hash based is a bit
more obvious how to combine the top-level objects, not sure on a b+tree
how you would combine the first pages.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhqrnwACgkQJdeBCYSNAAMUewCeNzT6bgcvdPxHXvvXgvZmYUi7
H14An22n+ZEabxq6S+I2dudOb968t3oA
=d11v
-----END PGP SIGNATURE-----