Rev 42: Switch over to the new filter_keys_by_presence functionality. in http://bzr.arbash-meinel.com/plugins/index2

John Arbash Meinel john at arbash-meinel.com
Wed Jul 16 15:05:40 BST 2008


At http://bzr.arbash-meinel.com/plugins/index2

------------------------------------------------------------
revno: 42
revision-id: john at arbash-meinel.com-20080716140531-8tmbw4y6vrjlxwfx
parent: john at arbash-meinel.com-20080716132023-zrufbn0leu228waw
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: index2
timestamp: Wed 2008-07-16 09:05:31 -0500
message:
  Switch over to the new filter_keys_by_presence functionality.
  
  Along with the improvements to the pybloom code, we know have a slight edge
  on search_from_revid and miss torture.
  iter_random won't ever be faster, because it *only* requests entries that
  are known to be present. So the bloom work is always going to hit, always
  going to have to compute all hashes, and always going to be extra work for
  no benefit.
  However, if we can get generic searching and miss torture to be faster,
  then we have a real chance to be a benefit on remote repositories, etc.
  Especially if we implement paged blooms.
-------------- next part --------------
=== modified file 'btree_index.py'
--- a/btree_index.py	2008-07-16 13:20:23 +0000
+++ b/btree_index.py	2008-07-16 14:05:31 +0000
@@ -87,6 +87,24 @@
             offsets[key] = tuple([int(hash(text, seed)) for seed in seeds])
         return offsets
 
+    def filter_keys_by_presence(self, keys):
+        """Filter out which keys are present.
+
+        Same as filter_by_presence, but converts key tuples into strings.
+        :param keys: An iterable of keys to check
+        :return: ([present_keys], [missing_keys])
+        """
+        # Convert tuples into strings, but preserve their ordering
+        str_and_keys = [('\x00'.join(key), key) for key in keys]
+        str_to_keys = dict(str_and_keys)
+        present_str_keys, missing_str_keys = self.filter_by_presence(
+            [k[0] for k in str_and_keys])
+        # Do we want to have to always map back *both* the present strings and
+        # the missing ones?
+        present_keys = [str_to_keys[s] for s in present_str_keys]
+        missing_keys = [str_to_keys[s] for s in missing_str_keys]
+        return present_keys, missing_keys
+
     # @staticmethod
     # def get_raw_offsets(keys):
     #     """Compute the raw internal offsets for these keys"""
@@ -967,17 +985,9 @@
         if self._use_blooms:
             global_bloom = self._get_global_bloom()
             if global_bloom is not None:
-                remaining_keys = []
-                for key in needed_keys:
-                    if key == last_key:
-                        dupes[0] += 1
-                        continue
-                    last_key = key
-                    if '\x00'.join(key) in global_bloom:
-                        remaining_keys.append(key)
-                    else:
-                        bloom_hits[0] += 1
-                needed_keys = remaining_keys
+                needed_keys, missing_keys = global_bloom.filter_keys_by_presence(
+                    needed_keys)
+                bloom_hits[0] += len(missing_keys)
         else:
             needed_keys = keys
         # 6 seconds spent in miss_torture using the sorted() line.



More information about the bazaar-commits mailing list