[MERGE] Updates to the "auto-buffer" logic of GraphIndex

Tue Sep 2 19:12:22 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> bb:tweak
> 
> I suggest testing with a couple of figures. E.g. 25% and 75%. or 10% and
> 30%.
> 
> Then go with whatever works best.
> 
> We're probably throwing out annotate and other operations by doing this
> - its the sort of thing that can be sensitive :(.
> 
> -Rob

"bzr annotate bzr+ssh://" is still broken, because of the Repo locking bug
(which, Andrew, I'm happy to bring in after feature freeze, *poke* :)

For "bzr annotate local/bzrlib/builtins.py" I see 50% auto-buffer be a little
bit faster, but within the noise margin. (14.5s versus 13.6s --show-ids, but I
also saw 15.4s versus 15.7s for --no-show-ids)

"bzr annotate http://localnetwork" again they are close to eachother
"bzr annotate http://bazaar-vcs.org/..." they go back and forth being faster,
so I think network issues dominate there.

bzrlib/builtins.py may not be the best test case, as it requires accessing a
*lot* of data.

This also generally holds true w/ or w/o --show-ids.

It turns out that for "bzrlib/index.py" it does trigger extra data requested.

Specifically, the code that says "if you request the whole thing, _buffer_all"
is a definite win. (part 1).

The code that says "if you are calling iter_entries and you've read 50% of the
file, _buffer_all" has a specific failure. Which is that the data for the
iter_entries call may already be available. So I changed the code so it
instead says "if I'm about to call readv(), and have already read >50% of the
file, _buffer_all instead". It uses the same location as (part 1), which helps
a bit.

So now I defer the extra GET until we've determined that the 50% we've read so
far isn't enough, rather than assuming once we hit 50% any future work will
need data access.

I think, in general, this patch will now have a relatively minor impact when
you are accessing a small set of data, and will be beneficial when you access
a large amount.

As for tuning 10%, 25%, 50%, 75%, etc. I think we need a fairly large set of
tests to get the optimal value. As you are correct, we shouldn't over-tune for
'bzr branch' and impact 'bzr log' times negatively. I *do* think that 50% is a
safe value for now, and I think we should focus on other things before
tweaking this down to the final XX%.

bzr annotate http://bazaar-vcs.org/bzr/bzr.dev/bzrlib.index.py is
39s versus 44s (10% loss)
bzr log --short -r -10..-1 http://bazaar-vcs.org... is
16.5s versus 16.7s (noise)
bzr log http://bazaar-vcs.org is:
1m56s versus 1m40s (16% gain)

There will probably be a couple mid-sized indexes that we don't need to
completely read for "local-ish" operations. However, we trade that off with
capping our upper limit of reads when we have large operations. I also think
there will always be edge cases where one algorithm shows better results than
the other.

I think one other major factor is the size of the preferred read versus the
average size of the index. And the fact that when padding the readv requests,
we actually add bytes to the end, rather than "centering" the request. Because
we are bisecting, this causes a bit of asymmetry in the reads. Anyway, I'd
certainly like to be done with this portion, and focus back on btree indexes
as the way forward.

I'll merge this refined with 50% auto-buffer, and we can tweak it in the future.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvYIFJdeBCYSNAAMRAke0AKCSLeo9KpfK9VDM0kOA1SXcZAYgDwCgxKKG
Ivm/WePDZTV9LWO/ASfcBYA=
=v560
-----END PGP SIGNATURE-----