Rev 3679: (jam) Tweaks to index code. If we readv a whole index, in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Tue Sep 2 19:44:56 BST 2008


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 3679
revision-id: pqm at pqm.ubuntu.com-20080902184447-n1nsxw1wcaumxwkb
parent: pqm at pqm.ubuntu.com-20080902152844-dext0kmx4m0u5szy
parent: john at arbash-meinel.com-20080902175200-nge9qgk0gklkd5ew
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Tue 2008-09-02 19:44:47 +0100
message:
  (jam) Tweaks to index code. If we readv a whole index,
  	treat it as a GET, and if we read >50% of an index, read it all.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
  bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
    ------------------------------------------------------------
    revno: 3665.3.5
    revision-id: john at arbash-meinel.com-20080902175200-nge9qgk0gklkd5ew
    parent: john at arbash-meinel.com-20080829185848-svqlofrr394rp9x9
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: graph_index_autobuffer
    timestamp: Tue 2008-09-02 12:52:00 -0500
    message:
      Move the point at which we 'buffer_all' if we've read >50% of the index.
      
      We were doing it as soon as you entered 'iter_entries', but often you may already have enough
      info to return results. And for small mostly local ops, we don't need to buffer all.
      (This happens mostly with moderate size indexes, where the first read of the header
      is enough to give you the data you need, but happens to be >50% of the whole file.)
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
    ------------------------------------------------------------
    revno: 3665.3.4
    revision-id: john at arbash-meinel.com-20080829185848-svqlofrr394rp9x9
    parent: john at arbash-meinel.com-20080829173847-n3h7gwv6hep9glbn
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: graph_index_autobuffer
    timestamp: Fri 2008-08-29 13:58:48 -0500
    message:
      Update a code comment.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 3665.3.3
    revision-id: john at arbash-meinel.com-20080829173847-n3h7gwv6hep9glbn
    parent: john at arbash-meinel.com-20080829171504-p99qggtlhhvmmzzj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: graph_index_autobuffer
    timestamp: Fri 2008-08-29 12:38:47 -0500
    message:
      If we read more than 50% of the whole index,
      go ahead and buffer the whole thing on the next request.
      This could be tuned (30%?, 75%?), but the old code could easily
      get to the point where we would end up reading more than
      1x the total bytes of the file.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
    ------------------------------------------------------------
    revno: 3665.3.2
    revision-id: john at arbash-meinel.com-20080829171504-p99qggtlhhvmmzzj
    parent: john at arbash-meinel.com-20080829171330-mrui49146o15ak98
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: graph_index_autobuffer
    timestamp: Fri 2008-08-29 12:15:04 -0500
    message:
      NEWS
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
    ------------------------------------------------------------
    revno: 3665.3.1
    revision-id: john at arbash-meinel.com-20080829171330-mrui49146o15ak98
    parent: pqm at pqm.ubuntu.com-20080829062746-ny482m2f2pukdhqt
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: graph_index_autobuffer
    timestamp: Fri 2008-08-29 12:13:30 -0500
    message:
      Updates to GraphIndex processing.
      
      If _read_and_parse() ends up reading the whole index, pass that
      on to _buffer_all() rather than doing the normal bisect code.
      This happens a lot when reading indexes over remote transports
      which read a 64kB page at a time. The old code would end up
      reading the whole file, parsing it into bisect regions, and
      then reading the whole file again in the next call.
      
      This also includes refactoring of the graph index test suite
      to pull out a lot of redundant code into helper functions.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
=== modified file 'NEWS'
--- a/NEWS	2008-09-02 15:28:44 +0000
+++ b/NEWS	2008-09-02 18:44:47 +0000
@@ -35,6 +35,17 @@
       to recover if you realize you uncommitted the wrong thing.
       (John Arbash Meinel)
 
+    * When reading index files, if we happen to read the whole file in a
+      single request treat it as a ``_buffer_all`` request. This happens
+      most often on small indexes over remote transports, where we default
+      to reading 64kB. It saves a round trip for each small index during
+      fetch operations. Also, if we have read more than 50% of an index
+      file, trigger a ``_buffer_all`` on the next request. This works
+      around some inefficiencies because reads don't fall neatly on page
+      boundaries, so we would ignore those bytes, but request them again
+      later. This could trigger a total read size of more than the whole
+      file. (John Arbash Meinel)
+
   BUG FIXES:
 
     * ``bzr rm`` is now aliased to ``bzr del`` for the convenience of svn

=== modified file 'bzrlib/index.py'
--- a/bzrlib/index.py	2008-08-14 19:22:59 +0000
+++ b/bzrlib/index.py	2008-09-02 17:52:00 +0000
@@ -272,6 +272,8 @@
         self._keys_by_offset = None
         self._nodes_by_key = None
         self._size = size
+        # The number of bytes we've read so far in trying to process this file
+        self._bytes_read = 0
 
     def __eq__(self, other):
         """Equal when self and other were created with the same parameters."""
@@ -288,14 +290,18 @@
         return "%s(%r)" % (self.__class__.__name__,
             self._transport.abspath(self._name))
 
-    def _buffer_all(self):
+    def _buffer_all(self, stream=None):
         """Buffer all the index data.
 
         Mutates self._nodes and self.keys_by_offset.
         """
+        if self._nodes is not None:
+            # We already did this
+            return
         if 'index' in debug.debug_flags:
             mutter('Reading entire index %s', self._transport.abspath(self._name))
-        stream = self._transport.get(self._name)
+        if stream is None:
+            stream = self._transport.get(self._name)
         self._read_prefix(stream)
         self._expected_elements = 3 + self._key_length
         line_count = 0
@@ -473,6 +479,7 @@
             return []
         if self._size is None and self._nodes is None:
             self._buffer_all()
+
         # We fit about 20 keys per minimum-read (4K), so if we are looking for
         # more than 1/20th of the index its likely (assuming homogenous key
         # spread) that we'll read the entire index. If we're going to do that,
@@ -629,10 +636,24 @@
         if self._bisect_nodes is None:
             readv_ranges.append(_HEADER_READV)
         self._read_and_parse(readv_ranges)
+        result = []
+        if self._nodes is not None:
+            # _read_and_parse triggered a _buffer_all because we requested the
+            # whole data range
+            for location, key in location_keys:
+                if key not in self._nodes: # not present
+                    result.append(((location, key), False))
+                elif self.node_ref_lists:
+                    value, refs = self._nodes[key]
+                    result.append(((location, key),
+                        (self, key, value, refs)))
+                else:
+                    result.append(((location, key),
+                        (self, key, self._nodes[key])))
+            return result
         # generate results:
         #  - figure out <, >, missing, present
         #  - result present references so we can return them.
-        result = []
         # keys that we cannot answer until we resolve references
         pending_references = []
         pending_locations = set()
@@ -688,9 +709,15 @@
             if length > 0:
                 readv_ranges.append((location, length))
         self._read_and_parse(readv_ranges)
+        if self._nodes is not None:
+            # The _read_and_parse triggered a _buffer_all, grab the data and
+            # return it
+            for location, key in pending_references:
+                value, refs = self._nodes[key]
+                result.append(((location, key), (self, key, value, refs)))
+            return result
         for location, key in pending_references:
             # answer key references we had to look-up-late.
-            index = self._parsed_key_index(key)
             value, refs = self._bisect_nodes[key]
             result.append(((location, key), (self, key,
                 value, self._resolve_references(refs))))
@@ -966,18 +993,32 @@
 
         :param readv_ranges: A prepared readv range list.
         """
-        if readv_ranges:
-            readv_data = self._transport.readv(self._name, readv_ranges, True,
-                self._size)
-            # parse
-            for offset, data in readv_data:
-                if self._bisect_nodes is None:
-                    # this must be the start
-                    if not (offset == 0):
-                        raise AssertionError()
-                    offset, data = self._parse_header_from_bytes(data)
-                # print readv_ranges, "[%d:%d]" % (offset, offset + len(data))
-                self._parse_region(offset, data)
+        if not readv_ranges:
+            return
+        if self._nodes is None and self._bytes_read * 2 >= self._size:
+            # We've already read more than 50% of the file and we are about to
+            # request more data, just _buffer_all() and be done
+            self._buffer_all()
+            return
+
+        readv_data = self._transport.readv(self._name, readv_ranges, True,
+            self._size)
+        # parse
+        for offset, data in readv_data:
+            self._bytes_read += len(data)
+            if offset == 0 and len(data) == self._size:
+                # We read the whole range, most likely because the
+                # Transport upcast our readv ranges into one long request
+                # for enough total data to grab the whole index.
+                self._buffer_all(StringIO(data))
+                return
+            if self._bisect_nodes is None:
+                # this must be the start
+                if not (offset == 0):
+                    raise AssertionError()
+                offset, data = self._parse_header_from_bytes(data)
+            # print readv_ranges, "[%d:%d]" % (offset, offset + len(data))
+            self._parse_region(offset, data)
 
     def _signature(self):
         """The file signature for this index type."""

=== modified file 'bzrlib/tests/test_index.py'
--- a/bzrlib/tests/test_index.py	2007-10-15 07:56:04 +0000
+++ b/bzrlib/tests/test_index.py	2008-09-02 17:52:00 +0000
@@ -353,6 +353,20 @@
 
 class TestGraphIndex(TestCaseWithMemoryTransport):
 
+    def make_key(self, number):
+        return (str(number) + 'X'*100,)
+
+    def make_value(self, number):
+            return str(number) + 'Y'*100
+
+    def make_nodes(self, count=64):
+        # generate a big enough index that we only read some of it on a typical
+        # bisection lookup.
+        nodes = []
+        for counter in range(count):
+            nodes.append((self.make_key(counter), self.make_value(counter), ()))
+        return nodes
+
     def make_index(self, ref_lists=0, key_elements=1, nodes=[]):
         builder = GraphIndexBuilder(ref_lists, key_elements=key_elements)
         for key, value, references in nodes:
@@ -372,62 +386,80 @@
         self.assertEqual([], index._parsed_byte_map)
         self.assertEqual([], index._parsed_key_map)
 
+    def test_key_count_buffers(self):
+        index = self.make_index(nodes=self.make_nodes(2))
+        # reset the transport log
+        del index._transport._activity[:]
+        self.assertEqual(2, index.key_count())
+        # We should have requested reading the header bytes
+        self.assertEqual([
+            ('readv', 'index', [(0, 200)], True, index._size),
+            ],
+            index._transport._activity)
+        # And that should have been enough to trigger reading the whole index
+        # with buffering
+        self.assertIsNot(None, index._nodes)
+
+    def test_lookup_key_via_location_buffers(self):
+        index = self.make_index()
+        # reset the transport log
+        del index._transport._activity[:]
+        # do a _lookup_keys_via_location call for the middle of the file, which
+        # is what bisection uses.
+        result = index._lookup_keys_via_location(
+            [(index._size // 2, ('missing', ))])
+        # this should have asked for a readv request, with adjust_for_latency,
+        # and two regions: the header, and half-way into the file.
+        self.assertEqual([
+            ('readv', 'index', [(30, 30), (0, 200)], True, 60),
+            ],
+            index._transport._activity)
+        # and the result should be that the key cannot be present, because this
+        # is a trivial index.
+        self.assertEqual([((index._size // 2, ('missing', )), False)],
+            result)
+        # And this should have caused the file to be fully buffered
+        self.assertIsNot(None, index._nodes)
+        self.assertEqual([], index._parsed_byte_map)
+
     def test_first_lookup_key_via_location(self):
-        index = self.make_index()
+        # We need enough data so that the _HEADER_READV doesn't consume the
+        # whole file. We always read 800 bytes for every key, and the local
+        # transport natural expansion is 4096 bytes. So we have to have >8192
+        # bytes or we will trigger "buffer_all".
+        # We also want the 'missing' key to fall within the range that *did*
+        # read
+        nodes = []
+        index = self.make_index(nodes=self.make_nodes(64))
         # reset the transport log
         del index._transport._activity[:]
         # do a _lookup_keys_via_location call for the middle of the file, which
         # is what bisection uses.
+        start_lookup = index._size // 2
         result = index._lookup_keys_via_location(
-            [(index._size // 2, ('missing', ))])
+            [(start_lookup, ('40missing', ))])
         # this should have asked for a readv request, with adjust_for_latency,
         # and two regions: the header, and half-way into the file.
         self.assertEqual([
-            ('readv', 'index', [(30, 30), (0, 200)], True, 60),
+            ('readv', 'index',
+             [(start_lookup, 800), (0, 200)], True, index._size),
             ],
             index._transport._activity)
         # and the result should be that the key cannot be present, because this
         # is a trivial index.
-        self.assertEqual([((index._size // 2, ('missing', )), False)],
-            result)
-        # And the regions of the file that have been parsed - in this case the
-        # entire file - should be in the parsed region map.
-        self.assertEqual([(0, 60)], index._parsed_byte_map)
-        self.assertEqual([(None, None)], index._parsed_key_map)
-
-    def test_parsing_parses_data_adjacent_to_parsed_regions(self):
-        # we trim data we recieve to remove the first and trailing
-        # partial lines, except when they start at the end/finish at the start
-        # of a region we've alread parsed/ the end of the file. The trivial
-        # test for this is an index with 1 key.
-        index = self.make_index(nodes=[(('name', ), 'data', ())])
-        # reset the transport log
-        del index._transport._activity[:]
-        result = index._lookup_keys_via_location(
-            [(index._size // 2, ('missing', ))])
-        # this should have asked for a readv request, with adjust_for_latency,
-        # and two regions: the header, and half-way into the file.
-        self.assertEqual([
-            ('readv', 'index', [(36, 36), (0, 200)], True, 72),
-            ],
-            index._transport._activity)
-        # and the result should be that the key cannot be present, because this
-        # is a trivial index and we should not have to do more round trips.
-        self.assertEqual([((index._size // 2, ('missing', )), False)],
-            result)
-        # The whole file should be parsed at this point.
-        self.assertEqual([(0, 72)], index._parsed_byte_map)
-        self.assertEqual([(None, ('name',))], index._parsed_key_map)
+        self.assertEqual([((start_lookup, ('40missing', )), False)],
+            result)
+        # And this should not have caused the file to be fully buffered
+        self.assertIs(None, index._nodes)
+        # And the regions of the file that have been parsed should be in the
+        # parsed_byte_map and the parsed_key_map
+        self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
+                         index._parsed_key_map)
 
     def test_parsing_non_adjacent_data_trims(self):
-        # generate a big enough index that we only read some of it on a typical
-        # bisection lookup.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        for counter in range(64):
-            nodes.append((make_key(counter), 'Y'*100, ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(64))
         result = index._lookup_keys_via_location(
             [(index._size // 2, ('40', ))])
         # and the result should be that the key cannot be present, because key is
@@ -437,8 +469,9 @@
             result)
         # and we should have a parse map that includes the header and the
         # region that was parsed after trimming.
-        self.assertEqual([(0, 3972), (5001, 8914)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(26)), (make_key(31), make_key(48))],
+        self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
             index._parsed_key_map)
 
     def test_parsing_data_handles_parsed_contained_regions(self):
@@ -448,53 +481,45 @@
         # which then trims the start and end so the parsed size is < readv
         # miniumum.
         # then a dual lookup (or a reference lookup for that matter) which
-        # abuts or overlaps the parsed region on both sides will need to 
+        # abuts or overlaps the parsed region on both sides will need to
         # discard the data in the middle, but parse the end as well.
         #
-        # we test this by doing a single lookup to seed the data, then 
-        # a lookup for two keys that are present, and adjacent - 
+        # we test this by doing a single lookup to seed the data, then
+        # a lookup for two keys that are present, and adjacent -
         # we except both to be found, and the parsed byte map to include the
         # locations of both keys.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        def make_value(number):
-            return 'Y'*100
-        for counter in range(128):
-            nodes.append((make_key(counter), make_value(counter), ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(128))
         result = index._lookup_keys_via_location(
             [(index._size // 2, ('40', ))])
         # and we should have a parse map that includes the header and the
         # region that was parsed after trimming.
-        self.assertEqual([(0, 3991), (11622, 15534)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(116)), (make_key(35), make_key(51))],
+        self.assertEqual([(0, 4045), (11759, 15707)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(116)),
+                          (self.make_key(35), self.make_key(51))],
             index._parsed_key_map)
         # now ask for two keys, right before and after the parsed region
         result = index._lookup_keys_via_location(
-            [(11450, make_key(34)), (15534, make_key(52))])
+            [(11450, self.make_key(34)), (15707, self.make_key(52))])
         self.assertEqual([
-            ((11450, make_key(34)), (index, make_key(34), make_value(34))),
-            ((15534, make_key(52)), (index, make_key(52), make_value(52))),
+            ((11450, self.make_key(34)),
+             (index, self.make_key(34), self.make_value(34))),
+            ((15707, self.make_key(52)),
+             (index, self.make_key(52), self.make_value(52))),
             ],
             result)
-        self.assertEqual([(0, 3991), (9975, 17799)], index._parsed_byte_map)
+        self.assertEqual([(0, 4045), (9889, 17993)], index._parsed_byte_map)
 
     def test_lookup_missing_key_answers_without_io_when_map_permits(self):
         # generate a big enough index that we only read some of it on a typical
         # bisection lookup.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        for counter in range(64):
-            nodes.append((make_key(counter), 'Y'*100, ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(64))
         # lookup the keys in the middle of the file
         result =index._lookup_keys_via_location(
             [(index._size // 2, ('40', ))])
         # check the parse map, this determines the test validity
-        self.assertEqual([(0, 3972), (5001, 8914)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(26)), (make_key(31), make_key(48))],
+        self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
             index._parsed_key_map)
         # reset the transport log
         del index._transport._activity[:]
@@ -502,7 +527,6 @@
         # not create a new transport request, and should return False (cannot
         # be in the index) - even when the byte location we ask for is outside
         # the parsed region
-        # 
         result = index._lookup_keys_via_location(
             [(4000, ('40', ))])
         self.assertEqual([((4000, ('40', )), False)],
@@ -512,20 +536,14 @@
     def test_lookup_present_key_answers_without_io_when_map_permits(self):
         # generate a big enough index that we only read some of it on a typical
         # bisection lookup.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        def make_value(number):
-            return str(number) + 'Y'*100
-        for counter in range(64):
-            nodes.append((make_key(counter), make_value(counter), ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(64))
         # lookup the keys in the middle of the file
         result =index._lookup_keys_via_location(
             [(index._size // 2, ('40', ))])
         # check the parse map, this determines the test validity
         self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(26)), (make_key(31), make_key(48))],
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
             index._parsed_key_map)
         # reset the transport log
         del index._transport._activity[:]
@@ -534,28 +552,25 @@
         # be in the index) - even when the byte location we ask for is outside
         # the parsed region
         # 
-        result = index._lookup_keys_via_location([(4000, make_key(40))])
+        result = index._lookup_keys_via_location([(4000, self.make_key(40))])
         self.assertEqual(
-            [((4000, make_key(40)), (index, make_key(40), make_value(40)))],
+            [((4000, self.make_key(40)),
+              (index, self.make_key(40), self.make_value(40)))],
             result)
         self.assertEqual([], index._transport._activity)
 
     def test_lookup_key_below_probed_area(self):
         # generate a big enough index that we only read some of it on a typical
         # bisection lookup.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        for counter in range(64):
-            nodes.append((make_key(counter), 'Y'*100, ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(64))
         # ask for the key in the middle, but a key that is located in the
         # unparsed region before the middle.
         result =index._lookup_keys_via_location(
             [(index._size // 2, ('30', ))])
         # check the parse map, this determines the test validity
-        self.assertEqual([(0, 3972), (5001, 8914)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(26)), (make_key(31), make_key(48))],
+        self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
             index._parsed_key_map)
         self.assertEqual([((index._size // 2, ('30', )), -1)],
             result)
@@ -563,19 +578,15 @@
     def test_lookup_key_above_probed_area(self):
         # generate a big enough index that we only read some of it on a typical
         # bisection lookup.
-        nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        for counter in range(64):
-            nodes.append((make_key(counter), 'Y'*100, ()))
-        index = self.make_index(nodes=nodes)
+        index = self.make_index(nodes=self.make_nodes(64))
         # ask for the key in the middle, but a key that is located in the
         # unparsed region after the middle.
         result =index._lookup_keys_via_location(
             [(index._size // 2, ('50', ))])
         # check the parse map, this determines the test validity
-        self.assertEqual([(0, 3972), (5001, 8914)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(26)), (make_key(31), make_key(48))],
+        self.assertEqual([(0, 4008), (5046, 8996)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(26)),
+                          (self.make_key(31), self.make_key(48))],
             index._parsed_key_map)
         self.assertEqual([((index._size // 2, ('50', )), +1)],
             result)
@@ -584,38 +595,75 @@
         # generate a big enough index that we only read some of it on a typical
         # bisection lookup.
         nodes = []
-        def make_key(number):
-            return (str(number) + 'X'*100,)
-        def make_value(number):
-            return str(number) + 'Y'*100
+        for counter in range(99):
+            nodes.append((self.make_key(counter), self.make_value(counter),
+                ((self.make_key(counter + 20),),)  ))
+        index = self.make_index(ref_lists=1, nodes=nodes)
+        # lookup a key in the middle that does not exist, so that when we can
+        # check that the referred-to-keys are not accessed automatically.
+        index_size = index._size
+        index_center = index_size // 2
+        result = index._lookup_keys_via_location(
+            [(index_center, ('40', ))])
+        # check the parse map - only the start and middle should have been
+        # parsed.
+        self.assertEqual([(0, 4027), (10198, 14028)], index._parsed_byte_map)
+        self.assertEqual([(None, self.make_key(17)),
+                          (self.make_key(44), self.make_key(5))],
+            index._parsed_key_map)
+        # and check the transport activity likewise.
+        self.assertEqual(
+            [('readv', 'index', [(index_center, 800), (0, 200)], True,
+                                  index_size)],
+            index._transport._activity)
+        # reset the transport log for testing the reference lookup
+        del index._transport._activity[:]
+        # now looking up a key in the portion of the file already parsed should
+        # only perform IO to resolve its key references.
+        result = index._lookup_keys_via_location([(11000, self.make_key(45))])
+        self.assertEqual(
+            [((11000, self.make_key(45)),
+              (index, self.make_key(45), self.make_value(45),
+               ((self.make_key(65),),)))],
+            result)
+        self.assertEqual([('readv', 'index', [(15093, 800)], True, index_size)],
+            index._transport._activity)
+
+    def test_lookup_key_can_buffer_all(self):
+        nodes = []
         for counter in range(64):
-            nodes.append((make_key(counter), make_value(counter),
-                ((make_key(counter + 20),),)  ))
+            nodes.append((self.make_key(counter), self.make_value(counter),
+                ((self.make_key(counter + 20),),)  ))
         index = self.make_index(ref_lists=1, nodes=nodes)
         # lookup a key in the middle that does not exist, so that when we can
         # check that the referred-to-keys are not accessed automatically.
-        result =index._lookup_keys_via_location(
-            [(index._size // 2, ('40', ))])
+        index_size = index._size
+        index_center = index_size // 2
+        result = index._lookup_keys_via_location([(index_center, ('40', ))])
         # check the parse map - only the start and middle should have been
         # parsed.
         self.assertEqual([(0, 3890), (6444, 10274)], index._parsed_byte_map)
-        self.assertEqual([(None, make_key(25)), (make_key(37), make_key(52))],
+        self.assertEqual([(None, self.make_key(25)),
+                          (self.make_key(37), self.make_key(52))],
             index._parsed_key_map)
         # and check the transport activity likewise.
         self.assertEqual(
-            [('readv', 'index', [(7906, 800), (0, 200)], True, 15813)],
+            [('readv', 'index', [(index_center, 800), (0, 200)], True,
+                                  index_size)],
             index._transport._activity)
         # reset the transport log for testing the reference lookup
         del index._transport._activity[:]
         # now looking up a key in the portion of the file already parsed should
         # only perform IO to resolve its key references.
-        result = index._lookup_keys_via_location([(4000, make_key(40))])
+        result = index._lookup_keys_via_location([(7000, self.make_key(40))])
         self.assertEqual(
-            [((4000, make_key(40)),
-              (index, make_key(40), make_value(40), ((make_key(60),),)))],
+            [((7000, self.make_key(40)),
+              (index, self.make_key(40), self.make_value(40),
+               ((self.make_key(60),),)))],
             result)
-        self.assertEqual([('readv', 'index', [(11976, 800)], True, 15813)],
-            index._transport._activity)
+        # Resolving the references would have required more data read, and we
+        # are already above the 50% threshold, so it triggered a _buffer_all
+        self.assertEqual([('get', 'index')], index._transport._activity)
 
     def test_iter_all_entries_empty(self):
         index = self.make_index()
@@ -640,6 +688,48 @@
             (index, ('ref', ), 'refdata', ((), ))]),
             set(index.iter_all_entries()))
 
+    def test_iter_entries_buffers_once(self):
+        index = self.make_index(nodes=self.make_nodes(2))
+        # reset the transport log
+        del index._transport._activity[:]
+        self.assertEqual(set([(index, self.make_key(1), self.make_value(1))]),
+                         set(index.iter_entries([self.make_key(1)])))
+        # We should have requested reading the header bytes
+        # But not needed any more than that because it would have triggered a
+        # buffer all
+        self.assertEqual([
+            ('readv', 'index', [(0, 200)], True, index._size),
+            ],
+            index._transport._activity)
+        # And that should have been enough to trigger reading the whole index
+        # with buffering
+        self.assertIsNot(None, index._nodes)
+
+    def test_iter_entries_buffers_by_bytes_read(self):
+        index = self.make_index(nodes=self.make_nodes(64))
+        list(index.iter_entries([self.make_key(10)]))
+        # The first time through isn't enough to trigger a buffer all
+        self.assertIs(None, index._nodes)
+        self.assertEqual(4096, index._bytes_read)
+        # Grabbing a key in that same page won't trigger a buffer all, as we
+        # still haven't read 50% of the file
+        list(index.iter_entries([self.make_key(11)]))
+        self.assertIs(None, index._nodes)
+        self.assertEqual(4096, index._bytes_read)
+        # We haven't read more data, so reading outside the range won't trigger
+        # a buffer all right away
+        list(index.iter_entries([self.make_key(40)]))
+        self.assertIs(None, index._nodes)
+        self.assertEqual(8192, index._bytes_read)
+        # On the next pass, we will not trigger buffer all if the key is
+        # available without reading more
+        list(index.iter_entries([self.make_key(32)]))
+        self.assertIs(None, index._nodes)
+        # But if we *would* need to read more to resolve it, then we will
+        # buffer all.
+        list(index.iter_entries([self.make_key(60)]))
+        self.assertIsNot(None, index._nodes)
+
     def test_iter_entries_references_resolved(self):
         index = self.make_index(1, nodes=[
             (('name', ), 'data', ([('ref', ), ('ref', )], )),
@@ -763,6 +853,29 @@
             (('name', ), '', ()), (('foo', ), '', ())])
         self.assertEqual(2, index.key_count())
 
+    def test_read_and_parse_tracks_real_read_value(self):
+        index = self.make_index(nodes=self.make_nodes(10))
+        del index._transport._activity[:]
+        index._read_and_parse([(0, 200)])
+        self.assertEqual([
+            ('readv', 'index', [(0, 200)], True, index._size),
+            ],
+            index._transport._activity)
+        # The readv expansion code will expand the initial request to 4096
+        # bytes, which is more than enough to read the entire index, and we
+        # will track the fact that we read that many bytes.
+        self.assertEqual(index._size, index._bytes_read)
+
+    def test_read_and_parse_triggers_buffer_all(self):
+        index = self.make_index(key_elements=2, nodes=[
+            (('name', 'fin1'), 'data', ()),
+            (('name', 'fin2'), 'beta', ()),
+            (('ref', 'erence'), 'refdata', ())])
+        self.assertTrue(index._size > 0)
+        self.assertIs(None, index._nodes)
+        index._read_and_parse([(0, index._size)])
+        self.assertIsNot(None, index._nodes)
+
     def test_validate_bad_index_errors(self):
         trans = self.get_transport()
         trans.put_bytes('name', "not an index\n")




More information about the bazaar-commits mailing list