Rev 4792: Some notes about memory consumption if we wrote it all in custom objects. in http://bazaar.launchpad.net/~jameinel/bzr/chk-index

Wed Oct 28 16:53:55 GMT 2009

At http://bazaar.launchpad.net/~jameinel/bzr/chk-index

------------------------------------------------------------
revno: 4792
revision-id: john at arbash-meinel.com-20091028165347-px6xmyjssmblo1om
parent: john at arbash-meinel.com-20091028162442-w9nyvpz5f7kbremd
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: chk-index
timestamp: Wed 2009-10-28 11:53:47 -0500
message:
  Some notes about memory consumption if we wrote it all in custom objects.
-------------- next part --------------
=== modified file 'bzrlib/chk_index.py'

--- a/bzrlib/chk_index.py	2009-10-28 16:24:42 +0000
+++ b/bzrlib/chk_index.py	2009-10-28 16:53:47 +0000
@@ -368,6 +368,16 @@
             raise ValueError('we only support entries with no references')
         if key_elements != 1:
             raise ValueError('we only support single-valued keys.')
+        # Note: We could create a significantly more optimized mapping here.
+        #       Something like a Judy Tree/trie/RB tree, etc. It would save a
+        #       fair amount of memory, and allow us to walk in sorted order.
+        #       Also, a given 'node' record just needs 3 ints, for 'group num',
+        #       in-group start, in-group length. Groups are never longer than
+        #       2/4GB and we'll never have >2B groups. This drops memory to
+        #       12bytes per entry + mapping overhead. Down from ~48. The dict
+        #       itself is 12MB for the 400+k keys from launchpad + 44*400k for
+        #       all the sha strings. A Judy Tree would make the mapping
+        #       overhead <20*400k.?Or 8MB down from ~30MB.
         self._nodes = {}
         # map from group (start, end) to the final offset for the group
         # the (0, 0) group is special, it indicates the empty content group
@@ -488,7 +498,9 @@
         for idx, bit_key in enumerate(sorted(self._nodes)):
             # TODO: We could probably just use bit_key rather than passing it
             #       into 'entry_coder'. This would save us from copying those
-            #       20 bytes into the new string.
+            #       20 bytes into the new string. Though at a cost of a 4 bytes
+            #       pointer in our output buffer. So not necessarily an instant
+            #       win
             bytes = self._entry_to_bytes(bit_key, header.entry_coder)
             entry_bytes.append(bytes)
             if have_mini_index: