weakrefs and avoiding bad gc cycles

Wed Jun 1 02:42:49 UTC 2011

The issue comes up in
<https://code.launchpad.net/~vila/bzr/config-lock-remote/+merge/62418>
of avoiding reference cycles between objects, and also for example
<https://code.launchpad.net/~gz/bzr/cleanup_testcases_by_collection_613247/+merge/32711>.
 I'd like to get clear on the general rule we've learned from previous
bugs and maybe put it in the developer guidelines, so that we can
always do things using the best practices.  This is just a straw man;
please correct me.

spiv said on irc he feels like he missed the memo about why this was
being done; so do I; this thread hopefully will be it.

There are two related problems:
 - Holding references to large objects can make them stay in memory
longer than they should; so trivially you should not have long-lived
objects holding unnecessary references to otherwise short-lived
objects.
 - Although python can gc objects held in reference cycles, it does
not seem to always do this very well, and avoiding the cycles can
allow objects to be released much faster.  So the cycle is a problem
even when all the objects ought to have the same lifetime.

I'd like to know how bad people think the second one is.  In that
review, Vincent seems to imply that we should always preemptively
avoid cycles.  If we have to do something approaching manual memory
management in Python that's pretty disappointing, and I almost cannot
believe it's actually that bad, or at least I'd like to see some
evidence Python's handling of cycles is so bad that we have to always
code around it.

Possible approaches:
1 - Have only what you could call "conceptually downwards" links
between objects, so that cycles don't occur: for instance, let the
Branch know about its configuration but not vice versa.

Sometimes thinking about this constraint actually gives a cleaner
factoring with less dependencies between objects.  However, sometimes
it is difficult to make this change.  The specific thing Vincent has
here is that the branch configuration is guarded on disk by the
branch's lock.  (I suppose you could make a refactoring where they
both share a lock object which does not have upward links.)

2- Have a 'close' method on important objects, that deletes references
to objects they hold, therefore giving the gc a bit of a hand.
Callers that don't explicitly close will normally be fine, they'll
just rely on the gc doing a decent job.  For some objects that are
locked, we could tie into that and release subsidiaries (such as their
configuration) when they are unlocked, which is normally taken to mean
that they should release their caches.

3- Manually use weakrefs, as Vincent's code does.  This is a bit
different from how we're used what you could call "semantic" weakrefs
in the past, where the code will use an object if it's present but
doesn't care if it's retained.  In this setup, the object _must_ be
present, and it really should be a real reference, but we're using a
weakref as a workaround for a Python bug.  This seems a bit
convoluted.

I can see how 3 is a useful tool to have if the gc deals really poorly
with cycles.  However it seems to have some big downsides: access
through this reference will likely be substantially slower; we may
have crashes where the weakrefs has expired; it complicates the code;
it makes it unclear that the object is expected to always be there.

Maybe there is a 4?

>From what I know so far, the rules I would try to follow are:

 * think about object lifetime; don't hold things unnecessarily
 * avoid class relationships that have reference cycles (both for the
sake of gc performance and general cleanliness)
 * for objects that hold large resources (especially external
resources) think about having a way to explicitly release them; and
think about deleting in-memory references when you do so
 * don't complicate the code to work around python bugs unless you
have actual evidence the complication improves things

.. edits welcome.

Martin