Rev 52: Merge in the latest updates to the gc trunk. in http://bzr.arbash-meinel.com/plugins/groupcompress_rabin

Fri Feb 27 16:07:47 GMT 2009

At http://bzr.arbash-meinel.com/plugins/groupcompress_rabin

------------------------------------------------------------
revno: 52
revision-id: john at arbash-meinel.com-20090227160746-1gt1m20vqk7i273c
parent: john at arbash-meinel.com-20090227160650-iv1rpvxsqejydxj7
parent: john at arbash-meinel.com-20090227051839-841q6ss4z8zm1353
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: groupcompress_rabin
timestamp: Fri 2009-02-27 10:07:46 -0600
message:
  Merge in the latest updates to the gc trunk.
modified:
  groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
  tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 48.1.7
    revision-id: john at arbash-meinel.com-20090227051839-841q6ss4z8zm1353
    parent: john at arbash-meinel.com-20090227051520-3bqqcchl92qup96h
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 23:18:39 -0600
    message:
      Update a Note/Todo
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 48.1.6
    revision-id: john at arbash-meinel.com-20090227051520-3bqqcchl92qup96h
    parent: john at arbash-meinel.com-20090227050931-bt0zwmqxrrheyosq
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 23:15:20 -0600
    message:
      Try even harder, now with even *more* streams.
      The compressed size drops by another 4x.
      Turn the data for each *layer* into a different stream.
      With this change, gc255 has compressed inventory drop to 1.5MB
      which is finally *smaller* than the source 'knit' format.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 48.1.5
    revision-id: john at arbash-meinel.com-20090227050931-bt0zwmqxrrheyosq
    parent: john at arbash-meinel.com-20090227035733-h1gmn3ymofoxc7zm
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 23:09:31 -0600
    message:
      As expected, splitting things up into streams of streams
      gives even better compression. (Down to 4.4MB for inventories).
      Probably the big win is that parent_id_basename content doesn't compress
      well at all versus id_to_entry content, and this way you don't
      get large offsets.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 48.1.4
    revision-id: john at arbash-meinel.com-20090227035733-h1gmn3ymofoxc7zm
    parent: john at arbash-meinel.com-20090227035442-4sk2vcl343hg9afv
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 21:57:33 -0600
    message:
      Add a general progress indicator for other parts of copy.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 48.1.3
    revision-id: john at arbash-meinel.com-20090227035442-4sk2vcl343hg9afv
    parent: john at arbash-meinel.com-20090227033445-gk1o5rr6t0u0d14i
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 21:54:42 -0600
    message:
      Add a progress indicator for chk pages.
      Fix a bug with handling signatures, which don't have a parent graph
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 48.1.2
    revision-id: john at arbash-meinel.com-20090227033445-gk1o5rr6t0u0d14i
    parent: john at arbash-meinel.com-20090227032810-wughh2jot1eiq94b
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 21:34:45 -0600
    message:
      Make it clear that the bits you get from 'apply_delta' are chunks, not lines.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 48.1.1
    revision-id: john at arbash-meinel.com-20090227032810-wughh2jot1eiq94b
    parent: john at arbash-meinel.com-20090225230422-4oigw03k7fq62eyb
    parent: john at arbash-meinel.com-20090227030449-r8cmknorcmlgaqtm
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: trunk
    timestamp: Thu 2009-02-26 21:28:10 -0600
    message:
      Merge the chk sorting code.
      Restore labels and sha1s in the stored data.
      Leave the 'extra' formats commented out for now.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 45.1.6
    revision-id: john at arbash-meinel.com-20090227030449-r8cmknorcmlgaqtm
    parent: john at arbash-meinel.com-20090226224152-z4jiazt0gp1vsylk
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 21:04:49 -0600
    message:
      Clustering chk pages properly makes a big difference.
      
      By iterating root nodes in the same order as the referencing inventory,
      and then iterating by search prefix, we get compression about 2:1 versus
      not compressing at all, which is probably 50% better than random ordering.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.5
    revision-id: john at arbash-meinel.com-20090226224152-z4jiazt0gp1vsylk
    parent: john at arbash-meinel.com-20090226220934-lnqvbe6uqle8eoum
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 16:41:52 -0600
    message:
      Try a different method of streaming the chk pages.
      In this method, we work out what chk pages are referenced by what inventory
      pages. And then fetch them based on breadth-first references.
      This should mean that pages that will compress well together are
      sent together, rather than in arbitrary ordering.
      Note that we might want to do even a little better, and use
      a list for the first time we encounter it, rather than sets everywhere.
      (we still want a set to make sure we don't add it multiple times to the list)
      
      Then again, 'unordered' may reorder it anyway, so it may not matter.
      We should also consider using multiple chk streams, because it
      will likely result in better compression, by forcing breaks in the
      gc groups.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.4
    revision-id: john at arbash-meinel.com-20090226220934-lnqvbe6uqle8eoum
    parent: john at arbash-meinel.com-20090226215937-4n69g2lfbjm3yyip
    parent: john at arbash-meinel.com-20090225230422-4oigw03k7fq62eyb
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 16:09:34 -0600
    message:
      Bring in the missing update from 'trunk'
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.3
    revision-id: john at arbash-meinel.com-20090226215937-4n69g2lfbjm3yyip
    parent: john at arbash-meinel.com-20090226215757-0rpcnz9vf6z3eikn
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 15:59:37 -0600
    message:
      Play with some experimental alternate hashes, comment them out for now.
    modified:
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 45.1.2
    revision-id: john at arbash-meinel.com-20090226215757-0rpcnz9vf6z3eikn
    parent: john at arbash-meinel.com-20090225221429-l0is3qxy1hvzuhes
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: experimental
    timestamp: Thu 2009-02-26 15:57:57 -0600
    message:
      experiment with removing the label and sha1 fields. Seems to shrink texts by 10-30%.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
-------------- next part --------------
=== modified file 'groupcompress.py'

--- a/groupcompress.py	2009-02-25 22:21:23 +0000
+++ b/groupcompress.py	2009-02-27 03:54:42 +0000
@@ -73,6 +73,7 @@
         else:
             contents = [next() for i in xrange(numbers[0])]
             result.append((op, None, numbers[0], contents))
+    ## return result
     return label, sha1, result
 
 
@@ -109,17 +110,18 @@
     """
     # gc-optimal ordering is approximately reverse topological,
     # properly grouped by file-id.
-    per_prefix_map = {'': []}
+    per_prefix_map = {}
     present_keys = []
     for item in parent_map.iteritems():
         key = item[0]
         if isinstance(key, str) or len(key) == 1:
-            per_prefix_map[''].append(item)
+            prefix = ''
         else:
-            try:
-                per_prefix_map[key[0]].append(item)
-            except KeyError:
-                per_prefix_map[key[0]] = [item]
+            prefix = key[0]
+        try:
+            per_prefix_map[prefix].append(item)
+        except KeyError:
+            per_prefix_map[prefix] = [item]
 
     for prefix in sorted(per_prefix_map):
         present_keys.extend(reversed(topo_sort(per_prefix_map[prefix])))
@@ -220,9 +222,11 @@
         if key[-1] is None:
             key = key[:-1] + ('sha1:' + sha1,)
         label = '\x00'.join(key)
-        new_lines = []
-        new_lines.append('label: %s\n' % label)
-        new_lines.append('sha1: %s\n' % sha1)
+        ## new_lines = []
+        new_lines = ['label: %s\n' % label,
+                     'sha1: %s\n' % sha1,
+                    ]
+        ## index_lines = []
         index_lines = [False, False]
         # setup good encoding for trailing \n support.
         if not lines or lines[-1].endswith('\n'):
@@ -275,12 +279,13 @@
         delta_details = self.labels_deltas[key]
         delta_lines = self.lines[delta_details[0][1]:delta_details[1][1]]
         label, sha1, delta = parse(delta_lines)
+        ## delta = parse(delta_lines)
         if label != key:
             raise AssertionError("wrong key: %r, wanted %r" % (label, key))
         # Perhaps we want to keep the line offsets too in memory at least?
-        lines = apply_delta(''.join(self.lines), delta)
-        sha1 = sha_strings(lines)
-        return lines, sha1
+        chunks = apply_delta(''.join(self.lines), delta)
+        sha1 = sha_strings(chunks)
+        return chunks, sha1
 
     def flush_multi(self, instructions, lines, new_lines, index_lines):
         """Flush a bunch of different ranges out.
@@ -474,7 +479,7 @@
             chunks = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
             parent_lines = [parent_cache[parent] for parent in parent_map[key]]
             parent_cache[key] = list(
-                reannotate(parent_lines, fulltext, key, None, head_cache))
+                reannotate(parent_lines, chunks, key, None, head_cache))
         return parent_cache[key]
 
     def check(self, progress_bar=None):
@@ -577,10 +582,12 @@
             valid until the iterator is advanced.
         """
         # keys might be a generator
-        keys = set(keys)
+        orig_keys = list(keys)
+        keys = set(orig_keys)
         if not keys:
             return
-        if not self._index.has_graph:
+        if (not self._index.has_graph
+            and ordering in ('topological', 'gc-optimal')):
             # Cannot topological order when no graph has been stored.
             ordering = 'unordered'
         # Cheap: iterate
@@ -599,13 +606,15 @@
             # Now group by source:
         elif ordering == 'gc-optimal':
             parent_map = dict((key, details[2]) for key, details in
-                locations.iteritems())
+                              locations.iteritems())
             for key in local_keys:
                 parent_map[key] = self._unadded_refs[key]
             # XXX: This only optimizes for the target ordering. We may need to
             #      balance that with the time it takes to extract ordering, by
             #      somehow grouping based on locations[key][0:3]
             present_keys = sort_gc_optimal(parent_map)
+        elif ordering == 'as-requested':
+            present_keys = [key for key in orig_keys if key in locations]
         else:
             # We want to yield the keys in a semi-optimal (read-wise) ordering.
             # Otherwise we thrash the _group_cache and destroy performance
@@ -628,13 +637,15 @@
             else:
                 index_memo, _, parents, (method, _) = locations[key]
                 plain, delta_lines = self._get_group_and_delta_lines(index_memo)
+                ## delta = parse(delta_lines)
                 label, sha1, delta = parse(delta_lines)
                 if label != key:
                     raise AssertionError("wrong key: %r, wanted %r" % (label, key))
-                lines = apply_delta(plain, delta)
-                if sha_strings(lines) != sha1:
+                chunks = apply_delta(plain, delta)
+                ## sha1 = sha_strings(lines)
+                if sha_strings(chunks) != sha1:
                     raise AssertionError('sha1 sum did not match')
-            yield ChunkedContentFactory(key, parents, sha1, lines)
+            yield ChunkedContentFactory(key, parents, sha1, chunks)
 
     def get_sha1s(self, keys):
         """See VersionedFiles.get_sha1s()."""
@@ -703,8 +714,7 @@
             except errors.UnavailableRepresentation:
                 adapter_key = record.storage_kind, 'fulltext'
                 adapter = get_adapter(adapter_key)
-                bytes = adapter.get_bytes(record,
-                    record.get_bytes_as(record.storage_kind))
+                bytes = adapter.get_bytes(record)
                 lines = osutils.split_lines(bytes)
             soft = False
             if len(record.key) > 1:

=== modified file 'repofmt.py'
--- a/repofmt.py	2009-02-26 04:17:19 +0000
+++ b/repofmt.py	2009-02-27 16:07:46 +0000
@@ -24,6 +24,7 @@
     debug,
     errors,
     knit,
+    inventory,
     pack,
     repository,
     ui,
@@ -58,8 +59,13 @@
     CHKInventoryRepository,
     RepositoryFormatPackDevelopment5,
     RepositoryFormatPackDevelopment5Hash16,
+##    RepositoryFormatPackDevelopment5Hash16b,
+##    RepositoryFormatPackDevelopment5Hash63,
+##    RepositoryFormatPackDevelopment5Hash127a,
+##    RepositoryFormatPackDevelopment5Hash127b,
     RepositoryFormatPackDevelopment5Hash255,
     )
+    from bzrlib import chk_map
     chk_support = True
 except ImportError:
     chk_support = False
@@ -236,6 +242,94 @@
         self.repo.signatures._index._add_callback = self.signature_index.add_callback
         self.repo.texts._index._add_callback = self.text_index.add_callback
 
+    def _get_filtered_inv_stream(self, source_vf, keys):
+        """Filter the texts of inventories, to find the chk pages."""
+        id_roots = []
+        p_id_roots = []
+        id_roots_set = set()
+        p_id_roots_set = set()
+        def _filter_inv_stream(stream):
+            for idx, record in enumerate(stream):
+                ### child_pb.update('fetch inv', idx, len(inv_keys_to_fetch))
+                bytes = record.get_bytes_as('fulltext')
+                chk_inv = inventory.CHKInventory.deserialise(None, bytes, record.key)
+                key = chk_inv.id_to_entry.key()
+                if key not in id_roots_set:
+                    id_roots.append(key)
+                    id_roots_set.add(key)
+                p_id_map = chk_inv.parent_id_basename_to_file_id
+                if p_id_map is not None:
+                    key = p_id_map.key()
+                    if key not in p_id_roots_set:
+                        p_id_roots_set.add(key)
+                        p_id_roots.append(key)
+                yield record
+        stream = source_vf.get_record_stream(keys, 'gc-optimal', True)
+        return _filter_inv_stream(stream), id_roots, p_id_roots
+
+    def _get_chk_stream(self, source_vf, keys, id_roots, p_id_roots, pb=None):
+        # We want to stream the keys from 'id_roots', and things they
+        # reference, and then stream things from p_id_roots and things they
+        # reference, and then any remaining keys that we didn't get to.
+
+        # We also group referenced texts together, so if one root references a
+        # text with prefix 'a', and another root references a node with prefix
+        # 'a', we want to yield those nodes before we yield the nodes for 'b'
+        # This keeps 'similar' nodes together
+
+        # Note: We probably actually want multiple streams here, to help the
+        #       client understand that the different levels won't compress well
+        #       against eachother
+        #       Test the difference between using one Group per level, and
+        #       using 1 Group per prefix. (so '' (root) would get a group, then
+        #       all the references to search-key 'a' would get a group, etc.)
+        remaining_keys = set(keys)
+        counter = [0]
+        def _get_referenced_stream(root_keys):
+            cur_keys = root_keys
+            while cur_keys:
+                keys_by_search_prefix = {}
+                remaining_keys.difference_update(cur_keys)
+                next_keys = set()
+                stream = source_vf.get_record_stream(cur_keys, 'as-requested',
+                                                     True)
+                def next_stream():
+                    for record in stream:
+                        bytes = record.get_bytes_as('fulltext')
+                        # We don't care about search_key_func for this code,
+                        # because we only care about external references.
+                        node = chk_map._deserialise(bytes, record.key,
+                                                    search_key_func=None)
+                        common_base = node._search_prefix
+                        if isinstance(node, chk_map.InternalNode):
+                            for prefix, value in node._items.iteritems():
+                                assert isinstance(value, tuple)
+                                if value not in next_keys:
+                                    keys_by_search_prefix.setdefault(prefix,
+                                        []).append(value)
+                                    next_keys.add(value)
+                        counter[0] += 1
+                        if pb is not None:
+                            pb.update('chk node', counter[0])
+                        yield record
+                yield next_stream()
+                # Double check that we won't be emitting any keys twice
+                next_keys = next_keys.intersection(remaining_keys)
+                cur_keys = []
+                for prefix in sorted(keys_by_search_prefix):
+                    cur_keys.extend(keys_by_search_prefix[prefix])
+        for stream in _get_referenced_stream(id_roots):
+            yield stream
+        for stream in _get_referenced_stream(p_id_roots):
+            yield stream
+        if remaining_keys:
+            trace.note('There were %d keys in the chk index, which'
+                       ' were not referenced from inventories',
+                       len(remaining_keys))
+            stream = source_vf.get_record_stream(remaining_keys, 'unordered',
+                                                 True)
+            yield stream
+
     def _execute_pack_operations(self, pack_operations, _packer_class=Packer,
                                  reload_func=None):
         """Execute a series of pack operations.
@@ -267,7 +361,13 @@
                        ('text_index', 'texts'),
                        ('signature_index', 'signatures'),
                       ]
+            # TODO: This is a very non-optimal ordering for chk_bytes. The
+            #       issue is that pages that are similar are not transmitted
+            #       together. Perhaps get_record_stream('gc-optimal') should be
+            #       taught about how to group chk pages?
+            has_chk = False
             if getattr(self, 'chk_index', None) is not None:
+                has_chk = True
                 to_copy.insert(2, ('chk_index', 'chk_bytes'))
 
             # Shouldn't we start_write_group around this?
@@ -302,8 +402,30 @@
                                       is_locked=self.repo.is_locked),
                         access=target_access,
                         delta=source_vf._delta)
-                    stream = source_vf.get_record_stream(keys, 'gc-optimal', True)
-                    target_vf.insert_record_stream(stream)
+                    stream = None
+                    child_pb = ui.ui_factory.nested_progress_bar()
+                    try:
+                        if has_chk:
+                            if vf_name == 'inventories':
+                                stream, id_roots, p_id_roots = self._get_filtered_inv_stream(
+                                    source_vf, keys)
+                            elif vf_name == 'chk_bytes':
+                                for stream in self._get_chk_stream(source_vf, keys,
+                                                    id_roots, p_id_roots,
+                                                    pb=child_pb):
+                                    target_vf.insert_record_stream(stream)
+                                # No more to copy
+                                stream = []
+                        if stream is None:
+                            def pb_stream():
+                                substream = source_vf.get_record_stream(keys, 'gc-optimal', True)
+                                for idx, record in enumerate(substream):
+                                    child_pb.update(vf_name, idx, len(keys))
+                                    yield record
+                            stream = pb_stream()
+                        target_vf.insert_record_stream(stream)
+                    finally:
+                        child_pb.finished()
                 new_pack._check_references() # shouldn't be needed
             except:
                 pb.finished()

=== modified file 'tests/test_groupcompress.py'
--- a/tests/test_groupcompress.py	2009-02-25 22:11:02 +0000
+++ b/tests/test_groupcompress.py	2009-02-27 03:28:10 +0000
@@ -40,6 +40,7 @@
             'factory':make_pack_factory(False, False, 1),
             'graph': False,
             'key_length':1,
+            'support_partial_insertion':False,
             }
         )
     applier = TestScenarioApplier()