Btree versus Graph Index

John Arbash Meinel john at arbash-meinel.com
Fri Nov 7 19:10:19 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I went ahead and mirrored the bzr.dev repository in a btree format.
It isn't 100% the same as bzr.dev, because it has 1 major pack, and then
11 minor ones. But I think it is a reasonable example of comparing the
performance of --1.9 versus --pack-0.92.

For testing, I created a copy of my local repository, and then pulled
the latest bzr.1.9 into it. My latency should be approximately the same
to both machines. At least, I believe bazaar-vcs.org and
people.ubuntu.com are served from the same data center.


$ time bzr branch http://bazaar-vcs.org/bzr/bzr.1.9
Branched 3817 revision(s).

real    1m16.128s
user    0m0.015s
sys     0m0.031s

$ time bzr branch http://people.ubuntu.com/~jameinel/bzr/releases/bzr.1.9
Branched 3817 revision(s).

real    0m31.965s
user    0m0.000s
sys     0m0.030s

And I attached the graph of my network usage between the two formats.

In both cases, the same *content* needed to be transferred. So the
differences are:

1) Repository layout. I would like to correct for this, but it is true
that the btree repo is going to be a little bit tighter packed than
bzr.dev. Though it *does* have 12 pack files, so it isn't a single-pack
repo.

2) btree versus graph-index. This (I believe) is the biggest difference.
I'll revisit my calculations here for why btree is so much better.

For the btree repository, the .rix file (which we used to determine the
revision ids present) is about 740kB, which represents 180 pages. I
don't know for sure, but this should be a 3-level tree (root node => 2
pages => 177 leaf pages). To find a single revision is 3-round-trips for
 12kB downloaded, to find multiple would be (4k + 8k + 64k) 76kB.

In my local repo, the largest .rix file is 1.6MB, which is 20077
revisions, and 25 64kB pages. log2(25) = 4.6, so roughly 5 round trips
to find a revision (320kB) downloaded.

The actual transfer only needed to copy 6 revisions. It is a bit tricky
to tease out, because it seems the branch also triggers an autopack,
which combines the new 6 revs with 116 others. But the copy also effects
8 files.

So just for finding the 6 new revisions, I would expect the btree repo
to copy 76kB, while the 0.92 repo would copy probably 400kB or so
(approx a factor of 5.3:1). As for round-trips, btree is probably 4 or 5
round trips, 0.92 is 6-7. Not a big difference there.

The bigger change would be when copying the file texts.

The largest btree index is 2MB, the largest 0.92 is 5.6MB. So we have
512 btree pages (1, 5, 506 leaves). And 89 64kB GI pages. So it takes
log2(89) = 6.5 aka 7 round trips = 448kB to find a single text entry,
versus again 3*4k = 12k for btree.
Now if you are looking for all keys simultaneously, and they are evenly
spread out, the btree will take (1+5+16)*4k = 88kB, which is still 5:1
versus 0.92.



The other big issue with minimizing index data, is that it is data that
we won't cache. So each time we want to get information from the remote
repository, we end up re-reading the same index data, even if we don't
re-read the .pack data.

Anyway, so far I'm very pleased with how much better btree indexes are.
I'm curious if there is any more that we could push it. But at least it
is a very good start. (Possibly bloom filters, though in our local
testing, they didn't help much.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkUkpsACgkQJdeBCYSNAAO57wCfSysCidzrlWzPFF0YVZix6CgF
w7MAmQGriX410AwZbIkU0ARi4LuwOeYn
=H8/s
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Update_gi_vs_btree.png
Type: image/png
Size: 4909 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081107/e49f33a7/attachment.png 


More information about the bazaar mailing list