High memory consumption during push

Tue Jul 1 02:17:27 BST 2008

I spent a fair bit of time yesterday investigating why bzr push consumes so much
memory.  The results are a bit disheartening.

Executive summary
-----------------

  * Pushing a branch takes almost as much memory as the packs of that branch!
  * The memory can be reduced by roughly 10% by interning graph keys.
  * Another 10% can be reclaimed by clearing graph index caches in Packer
    after we're finished reading from them.
  * Nearly half the memory appears to be lost to memory fragementation!

Investigation in detail
-----------------------

I started looking at pushing over bzr+ssh://, but soon realised that it was
nothing to do with the network code.  Pushing to /tmp/blah also showed higher
than expected memory consumption, nearly as high as the size of the repository
on disk!  The memory grows gradually during the operation, most of which is
spent inside one part or another of Packer._create_pack_from_pack.

Some investigation with pdb and my crude “countsref.py” hack
(http://twistedmatrix.com/users/spiv/countrefs.py) I found that the most
allocated type of object was tuples, over 600000 of them.  125144 of them had
purely string elements, but of those only about 18000 were obviously graph keys.
I found that in those tuples there were 1143 copies of
'NEWS-20050323055033-4e00b5db738777ff'.  That's about 40k of memory used to hold
a single 36 byte string.

I was able to trim about 10% of the memory fairly cheaply by interning graph
key elements in GraphIndex.  The patch is trivial:

=== modified file 'bzrlib/index.py'

--- bzrlib/index.py	2008-06-17 09:32:08 +0000
+++ bzrlib/index.py	2008-07-01 00:19:04 +0000
@@ -878,7 +878,7 @@
             if len(elements) != self._expected_elements:
                 raise errors.BadIndexData(self)
             # keys are tuples
-            key = tuple(elements[:self._key_length])
+            key = tuple([intern(elem) for elem in elements[:self._key_length]])
             if first_key is None:
                 first_key = key
             absent, references, value = elements[-3:]



There's possibly a small speed hit from doing that, or possibly a small speed
gain.

90% of too much is still too much, though.

At this point I started using Heapy (<http://guppy-pe.sourceforge.net/#Heapy>)
to analyse memory, in the hope of getting more details.  Heapy is pretty
powerful, but has terrible documentation.  I think it may also not be very
portable, but I have a vanilla x86 laptop.  I'll write up a short guide to using
it so people don't have to rely just on Heapy's own docs.  One great thing about
heapy is that is can report on more types than you'll find with
gc.get_objects(), e.g. non-container types like str, which are rather important!
It also knows the actual memory footprint of python objects, so it can tell you
exactly how many bytes an object or set of object requires.

Breaking into the push with SIGQUIT and using guppy, it reported:

  * 626393 tuples, consuming 19.9M
    * that's an average 32 bytes/tuple (note tuples have 24 bytes header +
      pointers, so on average tuples have 2 elements)
  * 220459 str, consuming 14.8M
    * an average of 67 bytes/str
  * 557 dict, consuming 3.6M (-> 6325 bytes/dict!)
    * excludes dict of module, dict of type
  * no other types of any significant size.

But approx 50% of the RES figure I saw in top is apparently not visible to
guppy.  More on this later...

My guess was that most of those tuples were graph keys, and cached graph
bisection information, etc.  Packer._create_pack_from_pack copies revision texts, then
inventory texts, then file texts, then finally signature texts, and at each step
there's another set of *.?ix files being parsed.  After each step though those
files aren't needed again (at least for "bzr push").  Reading in all the indices
does consume a lot of memory, for instance try this script:

---
import sys
from bzrlib.bzrdir import BzrDir
bd = BzrDir.open(sys.argv[1])
repo = bd.open_branch().repository
list(repo.inventories._index._graph_index.iter_all_entries())
list(repo.texts._index._graph_index.iter_all_entries())
list(repo.revisions._index._graph_index.iter_all_entries())
list(repo.signatures._index._graph_index.iter_all_entries())
raw_input('Now run top to see memory use.  Hit enter to exit.')
---

(Yes Robert I know iter_all_entries is evil :)

The memory that script consumes coincidentally is roughly the size of the data
in the .bzr/repository/packs directory.  I saw this both with bzr.dev, and with
Launchpad's source (which is several times larger).

So I wrote a hack to just reset the GraphIndex objects to empty after they
weren't needed anymore:

=== modified file 'bzrlib/index.py'
--- bzrlib/index.py	2008-06-17 09:32:08 +0000
+++ bzrlib/index.py	2008-07-01 00:41:20 +0000
@@ -284,6 +284,15 @@
     def __ne__(self, other):
         return not self.__eq__(other)
 
+    def empty_cache(self):
+        self._bisect_nodes = None
+        self._nodes = None
+        self._parsed_byte_map = []
+        self._parsed_key_map = []
+        self._key_count = None
+        self._keys_by_offset = None
+        self._nodes_by_key = None
+
     def _buffer_all(self):
         """Buffer all the index data.
 
@@ -999,6 +1008,10 @@
         """
         self._indices = indices
 
+    def empty_cache(self):
+        for index in self._indices:
+            index.empty_cache()
+
     def __repr__(self):
         return "%s(%s)" % (
                 self.__class__.__name__,

=== modified file 'bzrlib/repofmt/pack_repo.py'
--- bzrlib/repofmt/pack_repo.py	2008-06-17 05:02:34 +0000
+++ bzrlib/repofmt/pack_repo.py	2008-06-30 14:15:46 +0000
@@ -618,6 +618,9 @@
         total_items, readv_group_iter = self._revision_node_readv(revision_nodes)
         list(self._copy_nodes_graph(revision_index_map, self.new_pack._writer,
             self.new_pack.revision_index, readv_group_iter, total_items))
+        # Release some memory
+        for index in revision_index_map.iterkeys():
+            index.empty_cache()
         if 'pack' in debug.debug_flags:
             mutter('%s: create_pack: revisions copied: %s%s %d items t+%6.3fs',
                 time.ctime(), self._pack_collection._upload_transport.base,
@@ -651,6 +654,9 @@
         inv_lines = self._copy_nodes_graph(inventory_index_map,
             self.new_pack._writer, self.new_pack.inventory_index,
             readv_group_iter, total_items, output_lines=output_lines)
+        # Release some memory
+        for index in inventory_index_map.iterkeys():
+            index.empty_cache()
         if self.revision_ids:
             self._process_inventory_lines(inv_lines)
         else:
@@ -688,6 +694,9 @@
         total_items, readv_group_iter = self._least_readv_node_readv(text_nodes)
         list(self._copy_nodes_graph(text_index_map, self.new_pack._writer,
             self.new_pack.text_index, readv_group_iter, total_items))
+        # Release some memory
+        for index in text_index_map.iterkeys():
+            index.empty_cache()
         self._log_copied_texts()
 
     def _check_references(self):


But this only reclaimed another 10% or so, and I doubt this is always a good
tradeoff.

Taking a random guess and calling bzrlib.cache_utf8.clear_encoding_cache()
didn't make any difference.

And lots of the memory was still not accounted for in heapy.  At this point I
was looking at 62MB in top's RES column, but heapy only reporting 18MB.  I
deleted our C modules from my bzr tree in case they were somehow holding memory
that heapy couldn't see, but that made no difference.

At this point I added a pdb.set_trace() at the end of _create_pack_from_pack
(mostly I had been randomly interrupting with SIGQUIT).

I finally guessed that it might be memory fragmentation, and realised I could
perhaps test for this quite easily:

(Pdb) cruds = []
(Pdb) xxx = [cruds.append(['a b ' * 10 for x in range(30000)]) for y in range(10)]

No memory increase.  Do it again.

(Pdb) xxx = [cruds.append(['a b ' * 10 for x in range(30000)]) for y in range(10)]

15MB increase!  Do it twice more:

(Pdb) xxx = [cruds.append(['a b ' * 10 for x in range(30000)]) for y in range(10)]
(Pdb) xxx = [cruds.append(['a b ' * 10 for x in range(30000)]) for y in range(10)]

40MB increase.  And each subsequent pair of these adds another 40.  But the
first pair only added 15.  Thus there's at least 25MB of memory wasted,
presumably due to fragementation.

The good news then is that presumably a long-lived bzrlib process (like the
smart server) might have high memory consumption, but it should at least
stabilise rather than grow indefinitely... :/

What to do?
-----------

First, we need to be aware that fragmentation is a serious confounding factor in
measuring memory use.  The number and size of allocated objects is not
necessarily related to the memory consumption seen by the OS.  Presumably
allocation patterns of our code and internal details of CPython (which change
from release to release) will have an impact on the memory consumption seen by
the OS.  So naïve benchmarking of memory highwater marks might point us in
confusing directions.  Probably whenever we write tools that report on the
memory consumption reported by the OS, we should also report the corresponding
value from heapy, e.g. "from guppy import hpy; print hpy().heap().size".

We should also think about what we can do to avoid memory allocation patterns
that cause such sparse fragementation.  Pure python unfortunately doesn't give us any
way to allocate an object in a different memory arena.  Perhaps this is a reason
to move more code into C, such as GraphIndex?  Another possibility *might* be to
mmap pack files rather than reading them into strings.

I think we probably also want to merge my string interning one-liner :)

Other than that, I'm not sure what we can do.  Perhaps we need to be trying
harder to keep the high watermark down in the first place?  Ideas are welcome.

-Andrew.