Rev 3654: Some more data points for the time/repack tradeoff. in http://bzr.arbash-meinel.com/branches/bzr/1.7-dev/btree

Thu Aug 21 00:12:01 BST 2008

At http://bzr.arbash-meinel.com/branches/bzr/1.7-dev/btree

------------------------------------------------------------
revno: 3654
revision-id: john at arbash-meinel.com-20080820231159-lp0gxglwyxveiot7
parent: john at arbash-meinel.com-20080820224419-j0mtnzs2myy1n1y1
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: btree
timestamp: Wed 2008-08-20 18:11:59 -0500
message:
  Some more data points for the time/repack tradeoff.
  Instead of always using copy() we can use a multi-step process.
  1) Always add bytes ignoring flush until _default_min_compression
  2) Add bytes using Z_SYNC_FLUSH until we hit the page limit,
     and then repack N times.
  3) After repacking N times, switch to using copy() before adding
     a new node with Z_SYNC_FLUSH.
  4) If we re-use the copy() compressor we will get max packing
     or we can just use the SYNC compressor until it is full,
     and then return the copy.
  Current stats are done using the artificial data. We should
  produce a set using real-world data so we can pick the best
  time/space tradeoff. Further, we can use this information to
  find the best space values if someone issues a 'bzr pack'
  command. Since that is generally a 'spend more time to get
  better results' command.
modified:
  bzrlib/chunk_writer.py         chunk_writer.py-20080630234519-6ggn4id17nipovny-1
  bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
-------------- next part --------------
=== modified file 'bzrlib/chunk_writer.py'

--- a/bzrlib/chunk_writer.py	2008-08-20 22:44:19 +0000
+++ b/bzrlib/chunk_writer.py	2008-08-20 23:11:59 +0000
@@ -38,13 +38,14 @@
         number of times we will try.
         In testing, some values for 100k nodes::
 
-            _max_repack     time        final node count
-             1               8.0s       704
-             2               9.2s       491
-             3              10.6s       430
-             4              12.5s       406
-             5              13.9s       395
-            20              17.7s       390
+                            w/o copy            w/ copy             w/ copy & save
+            _max_repack     time    node count  time    node count  t       nc
+             1               8.0s   704          8.8s   494         14.2    390 #
+             2               9.2s   491          9.6s   432 #       12.9    390
+             3              10.6s   430 #       10.8s   408         12.0    390
+             4              12.5s   406                             12.8    390
+             5              13.9s   395
+            20              17.7s   390         17.8s   390
     :cvar _default_min_compression_size: The expected minimum compression.
         While packing nodes into the page, we won't Z_SYNC_FLUSH until we have
         received this much input data. This saves time, because we don't bloat
@@ -169,8 +170,36 @@
             self.seen_bytes = next_seen_size
         else:
             if not reserved and self.num_repack >= self._max_repack:
-                # We have packed too many times already.
+            # if (not reserved
+            #     and (self.num_repack > self._max_repack
+            #          or (self._max_repack == self._max_repack
+            #              and not self.compressor_has_copy))):
+            #     # We have packed too many times already.
                 return True
+            if not reserved and self.num_repack == self._max_repack:
+                assert self.compressor_has_copy
+                # We are trying to sneak in a few more keys before we run out
+                # of room, so copy the compressor. If we bust, we stop right
+                # now
+                copy = self.compressor.copy()
+                out = self.compressor.compress(bytes)
+                out += self.compressor.flush(Z_SYNC_FLUSH)
+                total_len = sum(map(len, self.bytes_list)) + len(out)
+                if total_len + 10 > capacity:
+                    self.compressor = copy
+                    # Don't try any more
+                    self.num_repack += 1
+                    return True
+                # It is tempting to use the copied compressor here, because it
+                # is more tightly packed. It gets us to the maximum packing
+                # value. However, it adds about the same overhead as setting
+                # _max_repack to a higher value
+                # self.compressor = copy
+                # out = self.compressor.compress(bytes)
+                self.bytes_in.append(bytes)
+                if out:
+                    self.bytes_list.append(out)
+                return False
             # This may or may not fit, try to add it with Z_SYNC_FLUSH
             out = self.compressor.compress(bytes)
             if out:

=== modified file 'bzrlib/tests/test_btree_index.py'
--- a/bzrlib/tests/test_btree_index.py	2008-08-20 20:32:47 +0000
+++ b/bzrlib/tests/test_btree_index.py	2008-08-20 23:11:59 +0000
@@ -293,6 +293,7 @@
         index = btree_index.BTreeGraphIndex(transport, 'index', size)
         # Seed the metadata, we're using internal calls now.
         index.key_count()
+        print '\n',index._row_lengths
         self.assertEqual(3, len(index._row_lengths),
             "Not enough rows: %r" % index._row_lengths)
         self.assertEqual(4, len(index._row_offsets))