[MERGE] Reading pack-names triggers 3 reads
John Arbash Meinel
john at arbash-meinel.com
Mon Nov 10 22:05:16 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> On Sun, 2008-11-09 at 16:27 -0600, John Arbash Meinel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> John Arbash Meinel wrote:
>>> While I was at it, I also found that 'BTree.iter_all_entries()' on the
>>> pack-names file, causes it to be read 3 times.
>>>
>>> 1) iter_all_entries() calls '.key_count()' which reads the root node
>>> 2) It then proceeds to call _read_nodes() for all "needed" nodes, but
>>> this includes the root node.
>>> 3) _read_nodes() when it doesn't know the size of the index, would
>>> trigger a read of the index, just to get its size, and then throw away
>>> the bytes, just to read it again in the next call.
>>>
>>> This patch adds both fixes, and finally drops the number of reads of
>>> "pack-names" to 1 as it should be.
>>>
>> Here is a more appropriate fix which just changes reading pack-names
>> from 3 times down to one. It includes tests, etc. This path really only
>> happens for 'pack-names' because it is the only file that we don't know
>> the size ahead of time.
>
> So, I've read the patch and it looks generally ok. I had a few concerns:
> - doesn't the page cache handle a request for the root? (e.g. is the
> cache not working, or is _read_nodes below that layer0.
>
_read_nodes is below that layer
> - I don't like the 'read full if we don't know size' change - I often
> work with bzrlib on big indices - hundred + megabytes, and this makes
> the api stop being memory capped - it becomes easier to get wrong.
> I appreciate the need to reduce round trips here, but I think simply
> changing it like you have will harm things. I'd like to see
> it require an explicit parameter or setting rather than be
> unconditional.
Well, the old code was wrong in the case that the number of pages was >
1. It read a single page at most, and set the size of the entire index
to that size.
We only run into this if self._size is None, and that only happens for
the pack-names file.
If you have other use cases for "self._size = None", I'm happy to
accomodate. But the existing code was wrong, and doing "transport.get()"
over HTTP downloads the whole file anyway. So I figured it was far
better to actually get_bytes() and make use of it.
Again, I'm open to suggestions.
>
>> There was also a subtle bug in the code if pack-names ever became larger
>> than a single page (which is really unlikely, but theoretically possible).
>
> I'm not sure where that bug was - could you enlarge on it?
>
> bb:resubmit
>
> -Rob
To be more explicit, the old code did:
if index == 0:
# Root node - special case
if self._size:
size = min(_PAGE_SIZE, self._size)
else:
stream = self._transport.get(self._name)
start = stream.read(_PAGE_SIZE)
# Avoid doing this again
self._size = len(start)
size = min(_PAGE_SIZE, self._size)
The important bits are:
stream = self._transport.get(self._name)
^- Get a file handle, for Local/SFTP this is just a file handle,
for HTTP/FTP this downloads the whole file into a StringIO
start = stream.read(_PAGE_SIZE)
^- Read at most _PAGE_SIZE bytes
# Avoid doing this again
self._size = len(start)
^- set the size of the index to the length of the read bytes
If the file is <= 1 PAGE this is correct, otherwise it
claims that the whole file is 1 PAGE long (due to the read
previously being capped at _PAGE_SIZE)
size = min(_PAGE_SIZE, self._size)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkYsBwACgkQJdeBCYSNAAOSDACgkQnzLQpozZYnBy1CT6QRPK/4
OX0An0b13lG6xPRKHxsR3yyXtXJY2+Tq
=xQp2
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list