Rev 49: Merge the chk sorting code. in http://bazaar.launchpad.net/%7Ebzr/bzr-groupcompress/trunk

Fri Feb 27 03:29:08 GMT 2009

At http://bazaar.launchpad.net/%7Ebzr/bzr-groupcompress/trunk

------------------------------------------------------------
revno: 49
revision-id: john at arbash-meinel.com-20090227032810-wughh2jot1eiq94b
parent: john at arbash-meinel.com-20090225230422-4oigw03k7fq62eyb
parent: john at arbash-meinel.com-20090227030449-r8cmknorcmlgaqtm
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: trunk
timestamp: Thu 2009-02-26 21:28:10 -0600
message:
  Merge the chk sorting code.
  Restore labels and sha1s in the stored data.
  Leave the 'extra' formats commented out for now.
modified:
  groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
  tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 45.1.6
    revision-id: john at arbash-meinel.com-20090227030449-r8cmknorcmlgaqtm
    parent: john at arbash-meinel.com-20090226224152-z4jiazt0gp1vsylk
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 21:04:49 -0600
    message:
      Clustering chk pages properly makes a big difference.
      
      By iterating root nodes in the same order as the referencing inventory,
      and then iterating by search prefix, we get compression about 2:1 versus
      not compressing at all, which is probably 50% better than random ordering.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.5
    revision-id: john at arbash-meinel.com-20090226224152-z4jiazt0gp1vsylk
    parent: john at arbash-meinel.com-20090226220934-lnqvbe6uqle8eoum
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 16:41:52 -0600
    message:
      Try a different method of streaming the chk pages.
      In this method, we work out what chk pages are referenced by what inventory
      pages. And then fetch them based on breadth-first references.
      This should mean that pages that will compress well together are
      sent together, rather than in arbitrary ordering.
      Note that we might want to do even a little better, and use
      a list for the first time we encounter it, rather than sets everywhere.
      (we still want a set to make sure we don't add it multiple times to the list)
      
      Then again, 'unordered' may reorder it anyway, so it may not matter.
      We should also consider using multiple chk streams, because it
      will likely result in better compression, by forcing breaks in the
      gc groups.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.4
    revision-id: john at arbash-meinel.com-20090226220934-lnqvbe6uqle8eoum
    parent: john at arbash-meinel.com-20090226215937-4n69g2lfbjm3yyip
    parent: john at arbash-meinel.com-20090225230422-4oigw03k7fq62eyb
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 16:09:34 -0600
    message:
      Bring in the missing update from 'trunk'
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.3
    revision-id: john at arbash-meinel.com-20090226215937-4n69g2lfbjm3yyip
    parent: john at arbash-meinel.com-20090226215757-0rpcnz9vf6z3eikn
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 15:59:37 -0600
    message:
      Play with some experimental alternate hashes, comment them out for now.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.2
    revision-id: john at arbash-meinel.com-20090226215757-0rpcnz9vf6z3eikn
    parent: john at arbash-meinel.com-20090225221429-l0is3qxy1hvzuhes
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 15:57:57 -0600
    message:
      experiment with removing the label and sha1 fields. Seems to shrink texts by 10-30%.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
-------------- next part --------------
=== modified file 'groupcompress.py'

--- a/groupcompress.py	2009-02-25 22:21:23 +0000
+++ b/groupcompress.py	2009-02-27 03:28:10 +0000
@@ -73,6 +73,7 @@
         else:
             contents = [next() for i in xrange(numbers[0])]
             result.append((op, None, numbers[0], contents))
+    ## return result
     return label, sha1, result
 
 
@@ -220,9 +221,11 @@
         if key[-1] is None:
             key = key[:-1] + ('sha1:' + sha1,)
         label = '\x00'.join(key)
-        new_lines = []
-        new_lines.append('label: %s\n' % label)
-        new_lines.append('sha1: %s\n' % sha1)
+        ## new_lines = []
+        new_lines = ['label: %s\n' % label,
+                     'sha1: %s\n' % sha1,
+                    ]
+        ## index_lines = []
         index_lines = [False, False]
         # setup good encoding for trailing \n support.
         if not lines or lines[-1].endswith('\n'):
@@ -275,6 +278,7 @@
         delta_details = self.labels_deltas[key]
         delta_lines = self.lines[delta_details[0][1]:delta_details[1][1]]
         label, sha1, delta = parse(delta_lines)
+        ## delta = parse(delta_lines)
         if label != key:
             raise AssertionError("wrong key: %r, wanted %r" % (label, key))
         # Perhaps we want to keep the line offsets too in memory at least?
@@ -474,7 +478,7 @@
             chunks = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
             parent_lines = [parent_cache[parent] for parent in parent_map[key]]
             parent_cache[key] = list(
-                reannotate(parent_lines, fulltext, key, None, head_cache))
+                reannotate(parent_lines, chunks, key, None, head_cache))
         return parent_cache[key]
 
     def check(self, progress_bar=None):
@@ -577,10 +581,11 @@
             valid until the iterator is advanced.
         """
         # keys might be a generator
-        keys = set(keys)
+        orig_keys = list(keys)
+        keys = set(orig_keys)
         if not keys:
             return
-        if not self._index.has_graph:
+        if not self._index.has_graph and ordering == 'topological':
             # Cannot topological order when no graph has been stored.
             ordering = 'unordered'
         # Cheap: iterate
@@ -606,6 +611,8 @@
             #      balance that with the time it takes to extract ordering, by
             #      somehow grouping based on locations[key][0:3]
             present_keys = sort_gc_optimal(parent_map)
+        elif ordering == 'as-requested':
+            present_keys = [key for key in orig_keys if key in locations]
         else:
             # We want to yield the keys in a semi-optimal (read-wise) ordering.
             # Otherwise we thrash the _group_cache and destroy performance
@@ -628,10 +635,12 @@
             else:
                 index_memo, _, parents, (method, _) = locations[key]
                 plain, delta_lines = self._get_group_and_delta_lines(index_memo)
+                ## delta = parse(delta_lines)
                 label, sha1, delta = parse(delta_lines)
                 if label != key:
                     raise AssertionError("wrong key: %r, wanted %r" % (label, key))
                 lines = apply_delta(plain, delta)
+                ## sha1 = sha_strings(lines)
                 if sha_strings(lines) != sha1:
                     raise AssertionError('sha1 sum did not match')
             yield ChunkedContentFactory(key, parents, sha1, lines)
@@ -703,8 +712,7 @@
             except errors.UnavailableRepresentation:
                 adapter_key = record.storage_kind, 'fulltext'
                 adapter = get_adapter(adapter_key)
-                bytes = adapter.get_bytes(record,
-                    record.get_bytes_as(record.storage_kind))
+                bytes = adapter.get_bytes(record)
                 lines = osutils.split_lines(bytes)
             soft = False
             if len(record.key) > 1:

=== modified file 'repofmt.py'
--- a/repofmt.py	2009-02-25 23:04:22 +0000
+++ b/repofmt.py	2009-02-27 03:04:49 +0000
@@ -24,6 +24,7 @@
     debug,
     errors,
     knit,
+    inventory,
     pack,
     repository,
     ui,
@@ -58,8 +59,13 @@
     CHKInventoryRepository,
     RepositoryFormatPackDevelopment5,
     RepositoryFormatPackDevelopment5Hash16,
+##    RepositoryFormatPackDevelopment5Hash16b,
+##    RepositoryFormatPackDevelopment5Hash63,
+##    RepositoryFormatPackDevelopment5Hash127a,
+##    RepositoryFormatPackDevelopment5Hash127b,
     RepositoryFormatPackDevelopment5Hash255,
     )
+    from bzrlib import chk_map
     chk_support = True
 except ImportError:
     chk_support = False
@@ -236,6 +242,86 @@
         self.repo.signatures._index._add_callback = self.signature_index.add_callback
         self.repo.texts._index._add_callback = self.text_index.add_callback
 
+    def _get_filtered_inv_stream(self, source_vf, keys):
+        """Filter the texts of inventories, to find the chk pages."""
+        id_roots = []
+        p_id_roots = []
+        id_roots_set = set()
+        p_id_roots_set = set()
+        def _filter_inv_stream(stream):
+            for idx, record in enumerate(stream):
+                ### child_pb.update('fetch inv', idx, len(inv_keys_to_fetch))
+                bytes = record.get_bytes_as('fulltext')
+                chk_inv = inventory.CHKInventory.deserialise(None, bytes, record.key)
+                key = chk_inv.id_to_entry.key()
+                if key not in id_roots_set:
+                    id_roots.append(key)
+                    id_roots_set.add(key)
+                p_id_map = chk_inv.parent_id_basename_to_file_id
+                if p_id_map is not None:
+                    key = p_id_map.key()
+                    if key not in p_id_roots_set:
+                        p_id_roots_set.add(key)
+                        p_id_roots.append(key)
+                yield record
+        stream = source_vf.get_record_stream(keys, 'gc-optimal', True)
+        return _filter_inv_stream(stream), id_roots, p_id_roots
+
+    def _get_chk_stream(self, source_vf, keys, id_roots, p_id_roots):
+        # We want to stream the keys from 'id_roots', and things they
+        # reference, and then stream things from p_id_roots and things they
+        # reference, and then any remaining keys that we didn't get to.
+
+        # We also group referenced texts together, so if one root references a
+        # text with prefix 'a', and another root references a node with prefix
+        # 'a', we want to yield those nodes before we yield the nodes for 'b'
+        # This keeps 'similar' nodes together
+
+        # Note: We probably actually want multiple streams here, to help the
+        #       client understand that the different levels won't compress well
+        #       against eachother
+        remaining_keys = set(keys)
+        def _get_referenced_stream(root_keys):
+            cur_keys = root_keys
+            while cur_keys:
+                keys_by_search_prefix = {}
+                remaining_keys.difference_update(cur_keys)
+                next_keys = set()
+                stream = source_vf.get_record_stream(cur_keys, 'as-requested',
+                                                     True)
+                for record in stream:
+                    bytes = record.get_bytes_as('fulltext')
+                    # We don't care about search_key_func for this code,
+                    # because we only care about external references.
+                    node = chk_map._deserialise(bytes, record.key,
+                                                search_key_func=None)
+                    common_base = node._search_prefix
+                    if isinstance(node, chk_map.InternalNode):
+                        for prefix, value in node._items.iteritems():
+                            assert isinstance(value, tuple)
+                            if value not in next_keys:
+                                keys_by_search_prefix.setdefault(prefix,
+                                    []).append(value)
+                                next_keys.add(value)
+                    yield record
+                # Double check that we won't be emitting any keys twice
+                next_keys = next_keys.intersection(remaining_keys)
+                cur_keys = []
+                for prefix in sorted(keys_by_search_prefix):
+                    cur_keys.extend(keys_by_search_prefix[prefix])
+        for record in _get_referenced_stream(id_roots):
+            yield record
+        for record in _get_referenced_stream(p_id_roots):
+            yield record
+        if remaining_keys:
+            trace.note('There were %d keys in the chk index, which'
+                       ' were not referenced from inventories',
+                       len(remaining_keys))
+            stream = source_vf.get_record_stream(remaining_keys, 'unordered',
+                                                 True)
+            for record in stream:
+                yield record
+
     def _execute_pack_operations(self, pack_operations, _packer_class=Packer,
                                  reload_func=None):
         """Execute a series of pack operations.
@@ -267,7 +353,13 @@
                        ('text_index', 'texts'),
                        ('signature_index', 'signatures'),
                       ]
+            # TODO: This is a very non-optimal ordering for chk_bytes. The
+            #       issue is that pages that are similar are not transmitted
+            #       together. Perhaps get_record_stream('gc-optimal') should be
+            #       taught about how to group chk pages?
+            has_chk = False
             if getattr(self, 'chk_index', None) is not None:
+                has_chk = True
                 to_copy.insert(2, ('chk_index', 'chk_bytes'))
 
             # Shouldn't we start_write_group around this?
@@ -302,7 +394,16 @@
                                       is_locked=self.repo.is_locked),
                         access=target_access,
                         delta=source_vf._delta)
-                    stream = source_vf.get_record_stream(keys, 'gc-optimal', True)
+                    stream = None
+                    if has_chk:
+                        if vf_name == 'inventories':
+                            stream, id_roots, p_id_roots = self._get_filtered_inv_stream(
+                                source_vf, keys)
+                        elif vf_name == 'chk_bytes':
+                            stream = self._get_chk_stream(source_vf, keys,
+                                                          id_roots, p_id_roots)
+                    if stream is None:
+                        stream = source_vf.get_record_stream(keys, 'gc-optimal', True)
                     target_vf.insert_record_stream(stream)
                 new_pack._check_references() # shouldn't be needed
             except:
@@ -552,6 +653,66 @@
             return ("Development repository format - hash16chk+groupcompress")
 
 
+##    class RepositoryFormatPackGCPlainCHK16b(RepositoryFormatPackDevelopment5Hash16b):
+##        """A hashed CHK+group compress pack repository."""
+##
+##        repository_class = GCCHKPackRepository
+##
+##        def get_format_string(self):
+##            """See RepositoryFormat.get_format_string()."""
+##            return ('Bazaar development format - hash16bchk+gc'
+##                    ' (needs bzr.dev from 1.13)\n')
+##
+##        def get_format_description(self):
+##            """See RepositoryFormat.get_format_description()."""
+##            return ("Development repository format - hash16bchk+groupcompress")
+##
+##
+##    class RepositoryFormatPackGCPlainCHK63(RepositoryFormatPackDevelopment5Hash63):
+##        """A hashed CHK+group compress pack repository."""
+##
+##        repository_class = GCCHKPackRepository
+##
+##        def get_format_string(self):
+##            """See RepositoryFormat.get_format_string()."""
+##            return ('Bazaar development format - hash63+gc'
+##                    ' (needs bzr.dev from 1.13)\n')
+##
+##        def get_format_description(self):
+##            """See RepositoryFormat.get_format_description()."""
+##            return ("Development repository format - hash63+groupcompress")
+##
+##
+##    class RepositoryFormatPackGCPlainCHK127a(RepositoryFormatPackDevelopment5Hash127a):
+##        """A hashed CHK+group compress pack repository."""
+##
+##        repository_class = GCCHKPackRepository
+##
+##        def get_format_string(self):
+##            """See RepositoryFormat.get_format_string()."""
+##            return ('Bazaar development format - hash127a+gc'
+##                    ' (needs bzr.dev from 1.13)\n')
+##
+##        def get_format_description(self):
+##            """See RepositoryFormat.get_format_description()."""
+##            return ("Development repository format - hash127a+groupcompress")
+##
+##
+##    class RepositoryFormatPackGCPlainCHK127b(RepositoryFormatPackDevelopment5Hash127b):
+##        """A hashed CHK+group compress pack repository."""
+##
+##        repository_class = GCCHKPackRepository
+##
+##        def get_format_string(self):
+##            """See RepositoryFormat.get_format_string()."""
+##            return ('Bazaar development format - hash127b+gc'
+##                    ' (needs bzr.dev from 1.13)\n')
+##
+##        def get_format_description(self):
+##            """See RepositoryFormat.get_format_description()."""
+##            return ("Development repository format - hash127b+groupcompress")
+
+
     class RepositoryFormatPackGCPlainCHK255(RepositoryFormatPackDevelopment5Hash255):
         """A hashed CHK+group compress pack repository."""
 
@@ -574,6 +735,10 @@
     if chk_support:
         formats = formats + (RepositoryFormatPackGCPlainCHK,
                              RepositoryFormatPackGCPlainCHK16,
+                             ## RepositoryFormatPackGCPlainCHK16b,
+                             ## RepositoryFormatPackGCPlainCHK63,
+                             ## RepositoryFormatPackGCPlainCHK127a,
+                             ## RepositoryFormatPackGCPlainCHK127b,
                              RepositoryFormatPackGCPlainCHK255)
     if isinstance(source._format, formats) or isinstance(target._format, formats):
         return False

=== modified file 'tests/test_groupcompress.py'
--- a/tests/test_groupcompress.py	2009-02-25 22:11:02 +0000
+++ b/tests/test_groupcompress.py	2009-02-27 03:28:10 +0000
@@ -40,6 +40,7 @@
             'factory':make_pack_factory(False, False, 1),
             'graph': False,
             'key_length':1,
+            'support_partial_insertion':False,
             }
         )
     applier = TestScenarioApplier()