Rethinking intern() for python

Tue Apr 7 23:25:55 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So one of the very interesting results of my memory profiling, is
showing that intern() is a very inefficient structure, for a number of
reasons. To give the specific values, after doing:
  bzr branch A B
of a small project, the total memory consumption is ~21MB

Of that, the largest single object is the 'interned' dict, at 1.57MB,
which contains 22k strings. One interesting bit, the size of it + the
referenced strings is only 2.4MB.

So the dict *by itself* takes up 65% of the consumed memory. It also
means that the average size of a referenced string is 37.4 bytes. A
'str' has 24 bytes of overhead, so the average string is 13.5 characters
long.

So to save references to 13.5*22k ~ 300kB of character data, we are
paying 2.4MB, or about 8:1 overhead.

When I looked at the actual references from interned, I saw mostly
variable names. Consider that every variable in your code goes through
the python intern dict. And when you look at the intern function, it
doesn't use setdefault logic, it actually does a get() followed by a
set(), which means the cost of interning is 1-2 lookups depending on
likelyhood, etc. (I saw a whole lot of strings as the error codes in
win32all/winerror.py...)

Anyway, I think we can do a lot better, and I think it might make a
surprising impact on things. Here are some concrete things:

  a) Don't keep a double reference to both key and value to the same
     object (1 pointer per entry)

  b) Don't cache the hash key in the set, as strings already cache them.
     (1 long per entry). We may re-evaluate this on performance, in
     case the locality of having it in the set makes a large
     difference. Given the size of this structure, making it smaller may
     be a bigger win than storing the hash. (cutting the size of the
     array in half could give better cache coherency than storing the
     hash, especially if the number of collisions is low)

  c) Use the existing lookup function one time. (PySet->lookup())
     already has a function which does the lookup, and then figures out
     where the object would go if it would exist in the set. intern()
     could almost trivially use this to avoid any double-lookup overhead
     to intern a new string.

  d) Tune stuff like default size, grow rate, load factor for the fact
     that we are dealing with a large number of mid-to-short strings. If
     their hashes are well distributed, we could probably get away with
     a load factor of 2. If we know we are likely to have lots and lots
     that never go away (you rarely *unload* modules, and all variable
     names are in the intern dict), that would suggest having a large
     initial size, and probably a wide growth factor to avoid spending a
     lot of time resizing the set.

  e) How tuned is String.hash() for the fact that most text is ascii.
     (especially true for variable names)

a & b would shrink the size of the intern() structure by 3:1, d could
cut it by another 2:1.

2) String object. This is a bit more controversial, but ATM every string
has 24-bytes overhead (on 32-bit) and I believe 48-bytes on 64-bit
platforms.

The current string definition is:
typedef struct {
    Py_ssize_t ob_refcnt;	32/64bit counter
    PyTypeObject *ob_type;	32/64bit pointer
    Py_ssize_t ob_size;		32/64bit counter
    long ob_shash;		32/64bit hash
    int ob_sstate;		32bit state
    char ob_sval[1];		8bit space for NULL
} PyStringObject;

ob_sstate is taking up 4-bytes just to store one of the values 0, 1, 2,
to indicate the INTERN state of the string. Which is 3-bytes of direct
waste.

Also, at least with my compiler, the 1 byte for ob_sval[] actually
causes the sizeof(PyStringObject) to get 4 more bytes. Which means that
every malloc() is over-allocating *another* 3 bytes.

Now perhaps malloc() is already allocating to some higher paging size,
but at least we don't force 6-bytes of waste.

Going a bit further, I would bet that 90% of all strings are <32kB long.
So you could change the ob_size member to be a 'short', and have the
upper bit set indicate that there are extra bytes elsewhere to indicate
the total length of the string. Or you could have everything but 0xFFFF
be valid, and have that value indicate there is a different structure to
check.

If you went approximately minimal, you would end up with:
typedef struct {
    Py_ssize_t ob_refcnt;	32/64bit counter
    PyTypeObject *ob_type;	32/64bit pointer
    long ob_shash;		32/64bit hash
    short ob_size;		32/64bit counter
    unsigned char ob_sstate;	32bit state
    char ob_sval[];		8bit space for NULL
} PyStringObject;

Which is right at 16 bytes of overhead, down from 24. I don't really
like that ob_sval is no longer aligned on a 4-byte boundary, so we could
probably do something about that. As I understand it, malloc would align
an actual allocation to 4-byte boundary anyway...

According to my address dumps, strings in python are aligned to 8-byte
boundaries. With a breakdown of 700k @ 0x...0 and 310k @ 0x...8.
So about 2:1 aligned to a 16-byte boundary, and there doesn't seem to be
much higher-order distribution.

I think fixing intern() would be worthwhile, I'm not positive about
changing PyString, though I think 64-bit platforms would see a bigger
win. Since the size of pointers and many integers doubles, but the
average length of a string is going to stay the same.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknb0vMACgkQJdeBCYSNAAP2GACfZEi2ptM6mmpOW4RaOOVpzoJg
nNoAnR5ukNdpCUvw0ucpZElb/S5O5xXc
=HEck
-----END PGP SIGNATURE-----