New toy for bzr developers: B+Tree index sketch

Robert Collins robertc at
Wed Jul 2 00:27:34 BST 2008

On Tue, 2008-07-01 at 17:23 -0500, John Arbash Meinel wrote:

> | Thanks for spending time tuning the size. I don't think 4K is
> | necessarily best; Indeed, 64K would clearly be better in terms of
> | accessing remote servers *if* we get a high hit rate. (Reading 128K to
> | access one key/determine one key is missing is bad compared to 8K; we
> | want lots of accesses to the same node to accommodate the overhead of
> | the read).
> I actually think doing 4k pages would be rather good for local access,
> and over remote we could just read multiple pages at a time.
> Also, depending on how you store them, we may or may not actually have
> locality. If they are topologically sorted, we have a lot of locality,
> lexicographically, I'm not really sure. We might have some locality, as
> all of the robertc at revisions are close together, but
> all of the merges with pqm@ will be far away.

So with a btree, to do graph locality we'd need some way to traverse the
tree that is graph-local (rather than key-local).

E.g. if we had node ids allocated serially within an index, and a
key->nodeid map that was very fast, we could do history walking within a
single index very fast, by starting with the key, doing one map to a
nodeid, and then after that nodes would be close to each other.

> | there is a .copy() on compressobj - I checked the python code :). But it
> | depends if the zlib present when python is built has deflateCopy, so I
> | felt it was likely to give use installation issues.
> It isn't exposed at the Python level, so we would need some sort of
> hackery to get it exposed.

Are you sure? It looked like it was exposed to me.

> |
> |> ... Note that deflateCopy duplicates the internal compression state
> |> which can be quite large, so this strategy is slow and can consume lots
> |> of memory.
> |
> | OTOH zlib is used for very large data sets, and our typical indices are
> | not; individual pages are definitely not.
> Yeah, and the definition of "slow and consume lots of memory" can be a
> very different time scale. And I'm *sure* it is faster than starting
> from scratch and compressing everything to this point. :)
> |
> |> I also tried switching to 8192-sized pack files. And with my code, it
> |> drops the size to 3.2MB. So about a 3% savings. Certainly not nearly as
> |> much as just trying to pack more into the existing 4096 bytes.
> |
> | -Rob
> As for putting this in the header... I actually was thinking the same
> thing. I don't really know how you could ever read a random page in the
> index, without first reading the header anyway, as you need the
> information stored there, don't you? (Number of each level of node, etc.)

Yes. Another thing though about 8K pages is that its 8K to decompress
and parse to access some key.

> I was thinking that "bzr pack" could possibly tune this sort of thing.
> We might try really big node sizes and see if there is any gain. A
> tunable parameter that only tunes <10% isn't worth it. Only if we found
> that going up to 64K pages suddenly gives us 50% the size would I really
> worry about it.

Yes, pack is ideal for this.

> We might also consider using bz2. I didn't find any specific reason
> (compressing *all* revision ids in bzr was 3.5:1 vs 3.1:1. Only a 13%
> gain, at the cost of bz2 overhead.) LZMA comes to mind, but my
> understanding is it is mostly gzip/zlib with a much larger window. So it
> probably won't give a specific benefit for our pages.
> By the way, if you are interested in having me do it, I would be
> interested in tuning the 'iter_entries()' lookup, to allow it to search
> for multiple keys at a time, rather than doing the bisect per key.

Please. see the TODO there? :).

> Also, the depth of the tree is known ahead of time, right? So for all
> keys we have the same number of steps in the bisect. Which makes the
> loop nice and easy to write.


> I also just had an idea. Right now for the btree, we have N+1 possible
> children. Specifically you have:
> ~       A           B
> ~   /        |             \
> before A  between A&B    after B

No. We have:
Root:        B
          /      \
    A              C
  /   \        /      \
<A    A-<B   B-<C     C-EOF 

> I wonder if we could have a win doing the btree as:
> ~      A        B        C
> ~      |        |        |
> ~      A...  ...B...  ...C
> My idea is that if your key is <= the shortest key, you can stop,
> because you know it isn't present. Same thing for key >= largest key.
> So you lose a +1 fan out, but you gain the ability to cull keys before
> you reach the bottom.

I considered this - but it isn't worth doing IMO. Done just at the edge
of an internal node, you lose one fan out, and save IO only at the very
edge of the entire key space the internal node covers. IF you want more
coverage, you need twice as many keys in the internal node - 1 for the
start, and 1 for the stop, for every child page.

If you think about it, say we get 100 keys per node. So given two nodes
N1 and N2, the only keys that you reduce miss-IO on are those between
END(N1) and START(N2). If you expect a homogeneous distribution of keys,
then if there are 100K keys in a repo(which we could ask for), and a
specific index has 20K keys, 80% of the keys are absent, but spread
equally. So between each key we have there are 4 we don't.
Missing-keys-in-nodes = 99*4
Missing-keys-between-nodes = 4
or a 1% reduction in traversals past that point, for a reduction in fan
out. But halving the density of internal nodes makes the tree higher,
and dropping(in this example) from 101 fan-out to 50 fan-out changes us
so, at 10K nodes we need a new root node giving:

with a 100 page cache, thats going to hurt.

I think a much better answer to miss lookups is a bloom filter per
internal node.

> As one of the problems (we believe) with our current indexes is the
> overhead for *misses*, this might be a big win.
> This isn't something that would be compatible with your current code, as
> it has to change the reader and writer. And it probably means we don't
> have a 'pure' b+tree.
> However, I get the feeling that b-trees were structured around *hits*
> and not misses.
> Also, note that we really need to be testing against a real-world
> repository like I had. With 18+ pack files. (I had 22648 nodes, so 22
> was our 'optimal' point.)

Yeah, I'm testing against fully packed repositories at this point in the
analysis. We can synthesis x% of miss keys spread over the keyspace and
query for them of course - or just write a converter that changes a
repository to btree without fetching.

> I feel like our current index does decently when there is 1 pack file,
> and starts to tail off dramatically after that.
> I also wonder about combining searches across pack files, but because of
> how things are sorted, I'm not sure that is a win. (hash based is a bit
> more obvious how to combine the top-level objects, not sure on a b+tree
> how you would combine the first pages.)

Basically you need to traverse the tree :)

GPG key available at: <>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : 

More information about the bazaar mailing list