weakrefs and avoiding bad gc cycles

Robert Collins robertc at robertcollins.net
Wed Jun 1 03:32:46 UTC 2011


On Wed, Jun 1, 2011 at 2:42 PM, Martin Pool <mbp at canonical.com> wrote:

> There are two related problems:
>  - Holding references to large objects can make them stay in memory
> longer than they should; so trivially you should not have long-lived
> objects holding unnecessary references to otherwise short-lived
> objects.

Yes, this is clearly important.

>  - Although python can gc objects held in reference cycles, it does
> not seem to always do this very well, and avoiding the cycles can
> allow objects to be released much faster.  So the cycle is a problem
> even when all the objects ought to have the same lifetime.

I'm not aware of any python implementation with significant issues
here. fepy,jython and pypy all have solid gc implementations. CPython
does full gc (IIRC) based on a combination of bytecodes run + object
allocations.

So nothing will leave cycles around indefinitely with *one* exception - __del__.

That said, for some things even a fraction of a second will matter.
One such case is when you have a file open in a directory that is
going to be deleted - that file -must- be closed before the directory
cleanup will fail on Windows. Other cases exist around threads and
sockets. AFAIK all such cases have OS resources involved for them to
matter.

> Possible approaches:
> 1 - Have only what you could call "conceptually downwards" links
> between objects, so that cycles don't occur: for instance, let the
> Branch know about its configuration but not vice versa.
>
> Sometimes thinking about this constraint actually gives a cleaner
> factoring with less dependencies between objects.  However, sometimes
> it is difficult to make this change.  The specific thing Vincent has
> here is that the branch configuration is guarded on disk by the
> branch's lock.  (I suppose you could make a refactoring where they
> both share a lock object which does not have upward links.)

This doesn't guarantee free-order on non-refcount implementations of
Python. So its insufficient if a free is necessary before some other
action takes place.

> 2- Have a 'close' method on important objects, that deletes references
> to objects they hold, therefore giving the gc a bit of a hand.
> Callers that don't explicitly close will normally be fine, they'll
> just rely on the gc doing a decent job.  For some objects that are
> locked, we could tie into that and release subsidiaries (such as their
> configuration) when they are unlocked, which is normally taken to mean
> that they should release their caches.

> 3- Manually use weakrefs, as Vincent's code does.  This is a bit
> different from how we're used what you could call "semantic" weakrefs
> in the past, where the code will use an object if it's present but
> doesn't care if it's retained.  In this setup, the object _must_ be
> present, and it really should be a real reference, but we're using a
> weakref as a workaround for a Python bug.  This seems a bit
> convoluted.

> I can see how 3 is a useful tool to have if the gc deals really poorly
> with cycles.  However it seems to have some big downsides: access
> through this reference will likely be substantially slower; we may
> have crashes where the weakrefs has expired; it complicates the code;
> it makes it unclear that the object is expected to always be there.
>
> Maybe there is a 4?
>
> From what I know so far, the rules I would try to follow are:
>
>  * think about object lifetime; don't hold things unnecessarily
>  * avoid class relationships that have reference cycles (both for the
> sake of gc performance and general cleanliness)
>  * for objects that hold large resources (especially external
> resources) think about having a way to explicitly release them; and
> think about deleting in-memory references when you do so
>  * don't complicate the code to work around python bugs unless you
> have actual evidence the complication improves things

Broadly +1.

I think a pithy statement of the issue (ignoring __del__ which this
isn't about AFAICT) is:
 - Python offers no guarantee on either timeliness or ordering of
freeing of resources.
 - So if you need to guarantee free/close/whatever before some other
operation takes place then we have to manually arrange for that to
take place.

And some applications of it for us are:
 - to avoid memory spikes we have to manually manage the objects which
hold file texts / compressed groups etc.
 - to avoid test suite problems on windows we need to coordinate
server thread closing of files
 - to avoid open file handle issues during disk operations we need to
manually close files

I think using weakrefs is diametrically opposite to what is needed -
not to mention likely slower.

-Rob



More information about the bazaar mailing list