Merge eats memory

Fri May 27 12:51:00 BST 2005

On Thu, May 26, 2005 at 10:54:37PM -0400, Aaron Bentley wrote:
[...]
> > Casual observation of top showed bzr using up to 700M of memory.
> 
> That's a rather shocking 777:1 ratio.  I'd have expected much closer to
> a 1:1 ratio, because it does store a changeset in memory.
> 
> There are several places where I read an entire file into memory.  If
> that's the cause, it'll probably be this line:
> 
> changeset.py:1444
>             if file(full_path_a, "rb").read() == \
>                 file(full_path_b, "rb").read():
> 
> But as you can see, no reference to the file contents is retained, so it
> could only be the cause if the heap allocator really stinks.  (Yes that
> code is sloppy, but it's clear, too.  Easy to change if it's a problem.)

CPython will immediately deallocate an object when its refcount hits zero,
so the code you have there is fine (aside from holding the entire files in
memory, of course :).

(It's only when there are reference cycles that deallocation delays happen,
because then it will only be collected when the cyclic garbage collector
runs)

I'd look elsewhere.  Here's some code that may help:
    http://twistedmatrix.com/users/spiv/countrefs.py

Simplest usage is to call the "logInThread" method, which will dump the N
most referenced types and classes with their ref counts to a file every
seconds.  This is useful information because every instance in python has a
reference to its class/type, so roughly speaking the ref count is the number
of instances of that class/type (certainly for the most referenced ones, the
handful of references in e.g. module dicts are insignificant).  It's also
the best you can do easily, because Python doesn't give you any way to find
out the size of an object.

So, it won't help you track down a 400Mb string, but it will help you
realise you're keeping a list of 400000 objects that you didn't mean to
keep.

[...]
> Anyhow, it would be useful to know whether the file comparisons are the
> cause of the problem.

At a glance, it seems unlikely.

Just in case, here's a function to compare files a little bit at a time:

def sameFile(path1, path2):
    file1 = open(path1, 'rb')
    file2 = open(path2, 'rb')

    chunksIter = iter(lambda: (file1.read(4096), file2.read(4096)), ('', ''))
    for chunk1, chunk2 in chunksIter:
        if chunk1 != chunk2:
            return False
    return True

-Andrew.