New toy for bzr developers: B+Tree index sketch

Tue Jul 1 20:29:14 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
| Robert Collins wrote:
| | On Tue, 2008-07-01 at 13:17 +1000, Robert Collins wrote:
| |> I have been tinkering with a B+Tree approach to the disk indices. I
| |> prefer this over a hash based index because of the ability to do
| |> in-order traversal when desired.

I also like that you have more information once you've read a btree
page. Maybe I'm wrong, but it feels like even reading a little bit, you
have some amount of info. With a hash, you don't have much useful info
until you've been reading leaf nodes for a while.

I'm not sure what sorted order we would actually be using that
"in-order" traversal is specifically useful. I suppose for btree they
have to be lexicographically sorted? It would be nice if they could be
topologically sorted.

| |>
| |> In short, I think we can save substantial size with the currently
| |> developed B+Tree and it allows addressing some of the memory problems
| |> Andrew has spent some time debugging.
| |
| | I have turned this into a repository implementation - currently only
| | handling plain repositories - but its easy enough to extend to rich
| | roots and subtrees if there is interest in looking at this.
| |
| | lp:~lifeless/+junk/bzr-index2
| |
| | Known defects:
| |  - iter_entries_prefix is not yet implemented (but its not needed in
| | bzr.dev either)
| |  - multi-key searches do not use the index as well as it can be to
| | reduce IO. In particular we just bisect each time.
| |
| | I'd love to see Ian's usertest run against --btree-plain, and any other
| | low-human-time to perform benchmarks. For instance, bzr's repository
| | size has been compared against gits, seeing what this does to help with
| | such comparisons might be nice, on a bzr-svn converted repository
| | (though this needs a rich-root format - see above). I'll toss one of
| | those in tomorrow probably.
| |

I did a couple in my earlier post.

...

| I'm curious why you settled on 4k pages. I suppose when I was tuning
| Dirstate bisecting, 4k pages was the sweet spot. I seems like larger
| pages gets you a bigger win for compression, though.

As I posted in my PS, it doesn't help much if we are "packing harder" to
start with. Only about 3% over 4k pages.

...

| with Robert's version gives 4.7MB, with my version gives 3.3MB. (about
| 1.4:1, or 30-40% savings.)
|
| I think the big win is I don't ever try to "guess" if there is going to
| be wasted space. I just assume it is going to work.
|

Just to see if the savings was because of packer efficiency, or to
wasted space because we *guessed* it wasn't going to fit, I wrote a
quick script to count how many \x00 occur in the file (since that is our
padding character). I realize that some nulls are delimiters, other
padding, etc. But I thought it might be relevant.

Robert's original btree code:
~  test_repo/.bzr/repository/indices/
97369 .iix
78845 .rix
22172 .six
331065 .tix
529451

My code, using 4k page sizes:
~  test_repo2/.bzr/repository/indices/
7549 .iix
7407 .rix
8126 .six
21521 .tix
44603

My code, using 8k page sizes:
~  test_repo3/.bzr/repository/indices/
16346 .iix
11703 .rix
15823 .six
16802 .tix
60674

going from 500k nulls => 45k nulls seems like a pretty big win. However,
it doesn't account for all of the storage savings.

I checked the headers for the different indices, and I think we also
have a certain amount of winning, just because we have fewer internal
nodes. Specifically:

For the original code:
~  iix: row_lengths=1,2,249
~  tix: row_lengths=1,7,643

For my extra-packing code:
~  iix: row_lengths=1,179
~  tix: row_lengths=1,4,436

For the 8k pages:
~  iix: row_lengths=1,89
~  tix: row_lengths=1,209

So I think the storage savings break down into 3 pieces:

1) We don't ever skip adding a node because we think we *might*
overflow. We just go ahead and overflow and try again. This probably
accounts for a lot of the 400k nulls being removed.

2) We pack without Z_SYNC_FLUSH when we repack. Which earlier tests
showed that this gives approx 3:1 compression instead of 2:1 compression.

3) By packing more onto each page, we don't need as many "bookkeeping"
pages. Which is more of a "knock-on" effect, but means that packing 10%
more onto each page probably gives more than a 10% decrease to total
size. This *might* also help bigger pages, for larger projects.

Attached is a patch which moves the page size into a constant, to make
it easier to modify if we want to test for optimal page sizes on large
projects. Specifically, you had a place where you did 4096 - 100 = 3996
and then hard-coded that, which caused weird problems when I replaced
all occurances of 4096 :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhqhYoACgkQJdeBCYSNAANyCACfSHeBeDZwgWlKNCZ8RhrS43tK
cNgAoLfEaC0SOf+pk126HitYYMJSaFGV
=U8k3
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: variable_pages.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080701/61e0849d/attachment.diff