[MERGE] Reading pack-names triggers 3 reads

Tue Nov 11 09:59:59 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2008-11-11 at 12:31 +1100, Robert Collins wrote:
> 
>> Largely performance testing; I like to dig into indices cheaply to see
>> what the internal structure etc is; and having to dig up the size first
>> to avoid thrashing the machine would be a nuisance - and like I said, it
>> makes the API less predictable.
> 
> (do note - clearly I've managed to avoid needing this or I would have
> tickled that bug. I think my primary objection is that for a
> memory-capped index, reading arbitrary size files without specific
> instruction, in one hit, is naughty at best).
> 
> -Rob

So, I'm trying to come up with something that fits your criteria, and
fixes both the read-three-times bug and the "I really do need the
correct size" bug.

If I wrote an api that did "read at most PAGE_SIZE bytes", the btree
would still not work, because you always need to know the total size of
the file (or you will fail when you read the past page). I suppose you
could add "readv(allow_last_request_to_be_short=True)", and then always
use that and get rid of the known size entirely. (not really recommended.)

A few options:

1) As I've proposed. Passing size=None is an implicit "read the whole
file" flag.

2) Add a second flag to make that more explicit. However, it would mean
that constructing a BTreeIndex() without either passing size or passing
read_whole_file=True should really raise an exception in the
constructor, because we know in advance that we can't work properly
later on.

3) Use some sort of stat() call, and implement some sort of stat() for
HTTP. This still causes us to issue 2 round-trip requests in order to
read pack-names, which is bad. Though I suppose the code could be:

if not self.read_whole_file:
  st = self._transport.stat(self.name) # this may not be implemented
  self._size = st.st_size
else:
  bytes = self._transport.get_bytes(self.name)
  self._size = len(bytes)
  ...

And then for the Repository code, we could make sure the
"read_whole_file" is set to True for pack-names.

Then your default instantiation of BTree would be memory capped by
default, but fail on HTTP files.

I'm not particularly fond of having something like BTreeIndex fail based
on transport capabilities, but maybe that works best for your use case?
(I'm also not particularly fond of writing a workaround in bzrlib for an
external use case.)
I'm also not sure we handle the self._size is None and self._file is not
None case correctly. It doesn't happen in bzrlib, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkZV54ACgkQJdeBCYSNAANCowCgvPgTGDf6UizyKcdl7kbUqrtu
2owAoIWF/WxwsWSipl7zvPNAQCkpzuDM
=slcN
-----END PGP SIGNATURE-----