Rev 3832: Change the _check_remap code to request unread pages in batches. in http://bzr.arbash-meinel.com/branches/bzr/brisbane/hack

John Arbash Meinel john at arbash-meinel.com
Thu Dec 25 15:15:54 GMT 2008


At http://bzr.arbash-meinel.com/branches/bzr/brisbane/hack

------------------------------------------------------------
revno: 3832
revision-id: john at arbash-meinel.com-20081225151532-52w2hxndu05ttr6z
parent: john at arbash-meinel.com-20081224214806-7qax2ikgd12n8ai3
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: hack
timestamp: Thu 2008-12-25 09:15:32 -0600
message:
  Change the _check_remap code to request unread pages in batches.
  
  The theory is that if you can probably find out that you won't fit even without
  reading all of the child nodes.
  Also, increase the page cache size slightly, so that the size should be able to
  hold a single working set.
  
  Initial results seem promising, with less time spent in _check_remap's
  get_record_stream calls.
  
  However, it may have just pushed the index lookups into the 'has_key()' code
  before we add the new lines. Part of the problem is that 'sha' keys are going
  to be randomly distributed, so we will have to page in a lot of the keys.
  
  Further, the chk index is costing a large amount of disk space. At present,
  they are 26MB, versus 9MB for the .tix.
  It makes the total index size 10% of the total repository size, which is something
  to be concerned about.
-------------- next part --------------
=== modified file 'bzrlib/chk_map.py'
--- a/bzrlib/chk_map.py	2008-12-24 20:23:12 +0000
+++ b/bzrlib/chk_map.py	2008-12-25 15:15:32 +0000
@@ -51,7 +51,9 @@
     )
 
 # approx 2MB
-_PAGE_CACHE_SIZE = 2*1024*1024
+# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
+# out, it takes 3.1MB to cache the layer.
+_PAGE_CACHE_SIZE = 4*1024*1024
 # We are caching bytes so len(value) is perfectly accurate
 _page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
 
@@ -1224,28 +1226,32 @@
             #       The other argument is that unmap() is uncommon, so we don't
             #       need to optimize it. But map() with a slightly shorter
             #       value may happen a lot.
-            stream = store.get_record_stream(keys, 'unordered', True)
-            nodes = []
-            # Fully consume the stream, even if we could determine that we
-            # don't need to continue. We requested the bytes, we may as well
-            # use them
-            for record in stream:
-                bytes = record.get_bytes_as('fulltext')
-                node = _deserialise(bytes, record.key)
-                self._items[keys[record.key]] = node
-                _page_cache[record.key] = bytes
-                nodes.append(node)
-            for node in nodes:
-                if isinstance(node, InternalNode):
-                    # We know we won't fit
-                    def stream_is_internal(): pass
-                    stream_is_internal()
-                    return self
-                for key, value in node._items.iteritems():
-                    if new_leaf._map_no_split(key, value):
-                        def stream_causes_split(): pass
-                        stream_causes_split()
+            batch_size = 10
+            key_order = list(keys)
+            for batch_start in range(0, len(key_order), batch_size):
+                batch = key_order[batch_start:batch_start + batch_size]
+                stream = store.get_record_stream(batch, 'unordered', True)
+                nodes = []
+                # Fully consume the stream, even if we could determine that we
+                # don't need to continue. We requested the bytes, we may as well
+                # use them
+                for record in stream:
+                    bytes = record.get_bytes_as('fulltext')
+                    node = _deserialise(bytes, record.key)
+                    self._items[keys[record.key]] = node
+                    _page_cache[record.key] = bytes
+                    nodes.append(node)
+                for node in nodes:
+                    if isinstance(node, InternalNode):
+                        # We know we won't fit
+                        def stream_is_internal(): pass
+                        stream_is_internal()
                         return self
+                    for key, value in node._items.iteritems():
+                        if new_leaf._map_no_split(key, value):
+                            def stream_causes_split(): pass
+                            stream_causes_split()
+                            return self
 
         # We have gone to every child, and everything fits in a single leaf
         # node, we no longer need this internal node



More information about the bazaar-commits mailing list