Rethinking intern() for python

James Henstridge james at jamesh.id.au
Wed Apr 8 10:49:43 BST 2009


On Wed, Apr 8, 2009 at 12:42 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> John Arbash Meinel writes:
>
>  > The current string definition is:
>  > typedef struct {
>  >     Py_ssize_t ob_refcnt;    32/64bit counter
>  >     PyTypeObject *ob_type;   32/64bit pointer
>  >     Py_ssize_t ob_size;              32/64bit counter
>  >     long ob_shash;           32/64bit hash
>  >     int ob_sstate;           32bit state
>  >     char ob_sval[1];         8bit space for NULL
>  > } PyStringObject;
>  >
>  > ob_sstate is taking up 4-bytes just to store one of the values 0, 1, 2,
>  > to indicate the INTERN state of the string. Which is 3-bytes of direct
>  > waste.
>  >
>  > Also, at least with my compiler, the 1 byte for ob_sval[] actually
>  > causes the sizeof(PyStringObject) to get 4 more bytes. Which means that
>  > every malloc() is over-allocating *another* 3 bytes.
>
> I don't think so.  This looks like the usual C idiom for a variable-
> sized data area, which works because C doesn't use the "1" for
> anything except determining the size of a statically-allocated
> PyStringObject.  So with the exception of the null string (of which
> there will be exactly one in your table), you're going to need to
> allocate that space anyway, and you can't move it, it has to be the
> last member of the struct.  If the compiler wastes space by allocating
> to alignment boundaries, it wastes it, but in general you can't do
> anything about it.  Or am I missing something?

So on my system (64-bit linux, Python 2.5), sizeof(PyStringObject) is
40.  The actual data in the structure as defined is 37 bytes, but
there are 3 pad bytes.

When allocating string objects, it allocates "sizeof(PyStringObject) +
size" bytes.  So allocating a 3 byte string will ask malloc for 43
bytes, even though only 40 bytes are needed (this includes a byte for
the null termination).  So it always wastes space even when padding
isn't needed.

This particular problem could be avoided by instead allocating
"offsetof(PyStringObject, ob_sval) + 1 + size" bytes instead.

James.



More information about the bazaar mailing list