weakrefs and avoiding bad gc cycles

John Arbash Meinel john at arbash-meinel.com
Wed Jun 1 13:23:42 UTC 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/1/2011 2:02 PM, Barry Warsaw wrote:
> I don't have a whole lot to add to this thread, except for one experience I
> had in a application I worked on in a previous life.  This was a potentially
> very long running app (but not a daemon) that created and freed tons of
> objects.  None were huge but lots of cycles were created and __del__ was not a
> problem.
> 
> The issue we hit was the time spent by Python in essentially gc.collect().  It
> turned out that a huge number of our objects were ending up in generation-2
> before the cycles they participated in were collectable.  Late in the run,
> Python would spend a huge amount of time just iterating over generation-2
> objects.  This is actually why I added the optional `generation` argument in
> gc.collect() back in the Python 2.5 time frame.
> 
> This may have no relevance to bzr, but it is useful to remember that while
> most programs can safely ignore the costs of cyclic gc, it can come back to
> bite you in some cases.
> 
> -Barry

Yeah, we've already programmed around a bunch of them. Specifically see
'bzrlib/_static_tuple_c.c'. We created an object which is tuple-like,
but doesn't allow arbitrary content, and thus cannot participate in
arbitrary reference cycles, and thus doesn't participate in the python
garbage collector (if it were in a cycle, it would live forever).

It knows it isn't in a cycle because it only allows other StaticTuple
objects, or string/unicode/int/bool/None. I believe Python 2.7 has at
least some support for noticing that a tuple only points to other
tuple/string/int/etc and when the gc walks and detects it is safe, it
will pull the given tuple out of the gc pointers.
On the flip side, python still allocates 16+ bytes for the GC header.
While StaticTuple doesn't have the GC overhead.
 sys.getsizeof(()) = 28 bytes on 32-bit, and ~56 bytes on 64-bit
 sys.getsizeof(StaticTuple()) = 12 bytes on 32-bit, 24bytes on 64-bit.

I wrote it as a C extension, though I believe newer version of Pyrex and
Cython allow you to 'cdef class ... [nogc]' to explicitly declare that
instances should not participate in the garbage collector, and should
not allocate the extra 16 bytes to it.

All that to say, if you are creating lots of objects, you're absolutely
better doing whatever you can to keep them from both participating in
reference cycles, and to completely remove them from the garbage
collector if you can.

This was the NEWS entry:
... a new StaticTuple datatype, allowing us to reduce
memory consumption (50%) and garbage collector overhead (40% faster) for
many operations.

IIRC, that was peak memory and time spent to do "bzr branch launchpad".

We also did something similar in bzrlib/_btree_serializer_pyx.pyx. While
the 'group' object participates in GC, the individual records are all
plain 'struct' objects that are managed in a simple array (don't
participate in GC). Though that also let us do stuff like store a sha1
hash as a 20 byte char array, rather than storing it as a 40-byte hex
string + 24/48bytes of PyString overhead. (A single record is stored as
40 bytes, which is smaller than just the string, not to mention 3 PyInt
and a tuple to bind it all together.)


All that to say, playing games with python's gc makes a *huge*
difference for bzr.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3mPV4ACgkQJdeBCYSNAAMSrgCePtPMssTJwwafvWCz77218aeS
BwsAn1pBqBw1+Wr+5bA4XQiU9+aW2FwR
=/jSp
-----END PGP SIGNATURE-----



More information about the bazaar mailing list