Bazaar: Out of memory

Fri May 9 16:19:02 BST 2008

2008/5/9 Ian Clatworthy <ian.clatworthy at internode.on.net>:
> James Henstridge wrote:
>
>  > The .join() method is faster because it does less memory allocations
>  > and copies than the for loop you list above.  It should allocate a
>  > single string for the final concatenated string.  In contrast, the for
>  > loop version does a string allocation and copy for every iteration (so
>  > in the final iteration you'd expect it to have allocated roughly twice
>  > the memory).
>
>  Hmmm ...
>
>  Launchpad shows the exact change I made in bzr-fastexport here:
>  http://bazaar.launchpad.net/~bzr/bzr-fastimport/fastimport.dev/revision/72.
>  The impact was *dramatic*: the memory shown by Gnome System Manager
>  importing a repository (http://repo.or.cz/w/AutomatorExifMover.git) with
>  a large binary file (5M) dropped from 275M to 42M. Maybe recent versions
>  of Python are smarter now about string concatenation in loops? Or maybe
>  the change itself is a red herring and we're seeing a symptom of a
>  reference counting bug, say?

The behaviour of join() has been pretty much the same for a very long
time.  It just does:
 1. iterate over the list, checking that each item is a string and
summing up the lengths.
 2. allocate a string large enough to hold the result
 3. iterate over the list again copying each item into the result string.

In contrast, each "line_bytes += line" iteration in your version does:
 1. allocate a new string large of length len(line_bytes) + len(line)
 2. copy line_bytes and line into the new string
 3. free the old line_bytes.

As you can see, this gives quadratic behaviour with the size of the
list rather than linear.  And in the last iteration of the loop, it
will temporarily be holding on to two copies of "line_bytes" (one
without the final line).  This is in addition to the original lines
list which is also in memory.

If you are seeing reduced memory usage, John's suggestion about adding
a gc.collect() call would be worth trying.  The for loop version
executes a lot more bytecode so probably ends up triggering GC at some
point.  That is less likely to happen with a single function call.

>  > If this change prevents MemoryErrors then something weird is going on.
>
>  In the case of bzr-fastimport, the code was calling read in a loop and
>  joining that list. In the case of bzr itself, I'm pretty sure we
>  arbitrarily partition huge binary files into "lines" based on where '\n'
>  characters just happened to be. Given this is user data, that could be
>  almost any size. (In Guido's case, I think his largest file is 40M FWIW.)

As mentioned above, I'd expect the peak memory usage for the for-loop
version to be quite a bit higher than the join() version, so I don't
think line lengths have much to do with it.

James.