[RFC] PyInternStringSet

Sat Apr 4 23:11:30 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Over the past few days, Michael Hudson and I have been working on a
python memory analysis tool. In doing so, I've gotten to learn a lot
about the internal details of python objects.

While looking over this, I discovered some of why "intern()" in Python
is not as cheap as one might expect it to be (comparing with other
languages, etc).

Specifically, intern() puts the string into a dict, (with both the key
and the value pointing to the string). A dict entry is actually 3
platform-size words (32/64 bit). It stores a pointer to the key, a
pointer to the value, and the hash of the key. Even further, because the
dict is a hash-table, it prefers a load factor of 1:4 (so for 50
entries, it wants a hash table of 200 entries.) When you get more than
50k items, then it switches to a load factor of 1:2.

This means, though, that before you get to 50k entries in the intern()
dict, you end up having 4*3=12 "words" per key. For 32-bit, this is 48
bytes of memory per key.

Now, a python string has 6 words of overhead. This breaks down as:
  1) refcount
  2) type pointer
  3) length
  4) cached hash
  5) 'state'
  6) char ob_sval[1] (I think this throws off the counting slightly,
     because this is actually only 1 byte, which is shared with the
     real string. However python allocates an extra 3 bytes for
     this.)

It is sort of a shame that python uses a whole 'int' for state, given it
can only have 3 values, and the space could be shared with the string
pointer that follows.

Anyway, going back to the intern() overhead. To use real numbers, the
size of a PyString on 32-bit is 24+length, and the overhead of putting
it into intern() is 48 bytes. (On 64-bit this goes to 48+length and 96,
respectively.)

So this causes a bias when considering the likelyhood that the string
will be used more than once. If they are short strings (<24 bytes), you
actually lose memory if they only exist 2 times.

However, you could easily write a different structure. Strings already
cache their hash, and you don't need a 'key' and 'value' pointer. So you
could get away with a single pointer instead of 3 per entry. And as
strings should hash fairly well, I think a load factor of 2:1 would be
reasonable.

Which would mean that you could shrink bytes per entry from 48 to 16 for
the intern() structure. You could also do something like hash buckets
instead of extra load factor, except the smallest would be 2 pointers,
which would be equivalent to a load factor of 2.

It would be best if we could get this as part of python core, so that
you can intern() a string and still have it deallocated (like we have
now with strings). However, we could do something like this in bzr and
just have it as a FIFO/LRU sort of cache.

I would like to propose this to python, but I'm not sure the best place.
Add something to their issue tracker? Post it to python-dev? I would
guess it would be better received if someone more active (Andrew, or
Michael?) posted it.

What do people think?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknX2xIACgkQJdeBCYSNAAPv0wCgykgCjNH/N860siRWfUedXSsh
//sAoNJuZPxrzuRcEnFPzM+5N91P5HWe
=bTI9
-----END PGP SIGNATURE-----