[MERGE] Reading pack-names triggers 3 reads

Mon Nov 10 22:05:16 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Sun, 2008-11-09 at 16:27 -0600, John Arbash Meinel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> John Arbash Meinel wrote:
>>> While I was at it, I also found that 'BTree.iter_all_entries()' on the
>>> pack-names file, causes it to be read 3 times.
>>>
>>> 1) iter_all_entries() calls '.key_count()' which reads the root node
>>> 2) It then proceeds to call _read_nodes() for all "needed" nodes, but
>>> this includes the root node.
>>> 3) _read_nodes() when it doesn't know the size of the index, would
>>> trigger a read of the index, just to get its size, and then throw away
>>> the bytes, just to read it again in the next call.
>>>
>>> This patch adds both fixes, and finally drops the number of reads of
>>> "pack-names" to 1 as it should be.
>>>
>> Here is a more appropriate fix which just changes reading pack-names
>> from 3 times down to one. It includes tests, etc. This path really only
>> happens for 'pack-names' because it is the only file that we don't know
>> the size ahead of time.
> 
> So, I've read the patch and it looks generally ok. I had a few concerns:
>  - doesn't the page cache handle a request for the root? (e.g. is the
> cache not working, or is _read_nodes below that layer0.
> 

_read_nodes is below that layer

>  - I don't like the 'read full if we don't know size' change - I often
>    work with bzrlib on big indices - hundred + megabytes, and this makes
>    the api stop being memory capped - it becomes easier to get wrong.
>    I appreciate the need to reduce round trips here, but I think simply
>    changing it like you have will harm things. I'd like to see
>    it require an explicit parameter or setting rather than be
>    unconditional.

Well, the old code was wrong in the case that the number of pages was >
1. It read a single page at most, and set the size of the entire index
to that size.

We only run into this if self._size is None, and that only happens for
the pack-names file.

If you have other use cases for "self._size = None", I'm happy to
accomodate. But the existing code was wrong, and doing "transport.get()"
over HTTP downloads the whole file anyway. So I figured it was far
better to actually get_bytes() and make use of it.

Again, I'm open to suggestions.

> 
>> There was also a subtle bug in the code if pack-names ever became larger
>> than a single page (which is really unlikely, but theoretically possible).
> 
> I'm not sure where that bug was - could you enlarge on it?
> 
> bb:resubmit
> 
> -Rob

To be more explicit, the old code did:

if index == 0:
    # Root node - special case
    if self._size:
        size = min(_PAGE_SIZE, self._size)
    else:
        stream = self._transport.get(self._name)
        start = stream.read(_PAGE_SIZE)
        # Avoid doing this again
        self._size = len(start)
        size = min(_PAGE_SIZE, self._size)

The important bits are:

        stream = self._transport.get(self._name)
        ^- Get a file handle, for Local/SFTP this is just a file handle,
	   for HTTP/FTP this downloads the whole file into a StringIO

        start = stream.read(_PAGE_SIZE)
	^- Read at most _PAGE_SIZE bytes

        # Avoid doing this again
        self._size = len(start)
	^- set the size of the index to the length of the read bytes
	   If the file is <= 1 PAGE this is correct, otherwise it
	   claims that the whole file is 1 PAGE long (due to the read
	   previously being capped at _PAGE_SIZE)

        size = min(_PAGE_SIZE, self._size)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkYsBwACgkQJdeBCYSNAAOSDACgkQnzLQpozZYnBy1CT6QRPK/4
OX0An0b13lG6xPRKHxsR3yyXtXJY2+Tq
=xQp2
-----END PGP SIGNATURE-----