Btree versus Graph Index
John Arbash Meinel
john at arbash-meinel.com
Fri Nov 7 19:10:19 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So I went ahead and mirrored the bzr.dev repository in a btree format.
It isn't 100% the same as bzr.dev, because it has 1 major pack, and then
11 minor ones. But I think it is a reasonable example of comparing the
performance of --1.9 versus --pack-0.92.
For testing, I created a copy of my local repository, and then pulled
the latest bzr.1.9 into it. My latency should be approximately the same
to both machines. At least, I believe bazaar-vcs.org and
people.ubuntu.com are served from the same data center.
$ time bzr branch http://bazaar-vcs.org/bzr/bzr.1.9
Branched 3817 revision(s).
real 1m16.128s
user 0m0.015s
sys 0m0.031s
$ time bzr branch http://people.ubuntu.com/~jameinel/bzr/releases/bzr.1.9
Branched 3817 revision(s).
real 0m31.965s
user 0m0.000s
sys 0m0.030s
And I attached the graph of my network usage between the two formats.
In both cases, the same *content* needed to be transferred. So the
differences are:
1) Repository layout. I would like to correct for this, but it is true
that the btree repo is going to be a little bit tighter packed than
bzr.dev. Though it *does* have 12 pack files, so it isn't a single-pack
repo.
2) btree versus graph-index. This (I believe) is the biggest difference.
I'll revisit my calculations here for why btree is so much better.
For the btree repository, the .rix file (which we used to determine the
revision ids present) is about 740kB, which represents 180 pages. I
don't know for sure, but this should be a 3-level tree (root node => 2
pages => 177 leaf pages). To find a single revision is 3-round-trips for
12kB downloaded, to find multiple would be (4k + 8k + 64k) 76kB.
In my local repo, the largest .rix file is 1.6MB, which is 20077
revisions, and 25 64kB pages. log2(25) = 4.6, so roughly 5 round trips
to find a revision (320kB) downloaded.
The actual transfer only needed to copy 6 revisions. It is a bit tricky
to tease out, because it seems the branch also triggers an autopack,
which combines the new 6 revs with 116 others. But the copy also effects
8 files.
So just for finding the 6 new revisions, I would expect the btree repo
to copy 76kB, while the 0.92 repo would copy probably 400kB or so
(approx a factor of 5.3:1). As for round-trips, btree is probably 4 or 5
round trips, 0.92 is 6-7. Not a big difference there.
The bigger change would be when copying the file texts.
The largest btree index is 2MB, the largest 0.92 is 5.6MB. So we have
512 btree pages (1, 5, 506 leaves). And 89 64kB GI pages. So it takes
log2(89) = 6.5 aka 7 round trips = 448kB to find a single text entry,
versus again 3*4k = 12k for btree.
Now if you are looking for all keys simultaneously, and they are evenly
spread out, the btree will take (1+5+16)*4k = 88kB, which is still 5:1
versus 0.92.
The other big issue with minimizing index data, is that it is data that
we won't cache. So each time we want to get information from the remote
repository, we end up re-reading the same index data, even if we don't
re-read the .pack data.
Anyway, so far I'm very pleased with how much better btree indexes are.
I'm curious if there is any more that we could push it. But at least it
is a very good start. (Possibly bloom filters, though in our local
testing, they didn't help much.)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkUkpsACgkQJdeBCYSNAAO57wCfSysCidzrlWzPFF0YVZix6CgF
w7MAmQGriX410AwZbIkU0ARi4LuwOeYn
=H8/s
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Update_gi_vs_btree.png
Type: image/png
Size: 4909 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081107/e49f33a7/attachment.png
More information about the bazaar
mailing list