Rethinking intern() for python

James Henstridge james at jamesh.id.au
Wed Apr 8 11:01:55 BST 2009


On Wed, Apr 8, 2009 at 5:49 PM, James Henstridge <james at jamesh.id.au> wrote:
> On Wed, Apr 8, 2009 at 12:42 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> John Arbash Meinel writes:
>>
>>  > The current string definition is:
>>  > typedef struct {
>>  >     Py_ssize_t ob_refcnt;    32/64bit counter
>>  >     PyTypeObject *ob_type;   32/64bit pointer
>>  >     Py_ssize_t ob_size;              32/64bit counter
>>  >     long ob_shash;           32/64bit hash
>>  >     int ob_sstate;           32bit state
>>  >     char ob_sval[1];         8bit space for NULL
>>  > } PyStringObject;
>>  >
>>  > ob_sstate is taking up 4-bytes just to store one of the values 0, 1, 2,
>>  > to indicate the INTERN state of the string. Which is 3-bytes of direct
>>  > waste.
>>  >
>>  > Also, at least with my compiler, the 1 byte for ob_sval[] actually
>>  > causes the sizeof(PyStringObject) to get 4 more bytes. Which means that
>>  > every malloc() is over-allocating *another* 3 bytes.
>>
>> I don't think so.  This looks like the usual C idiom for a variable-
>> sized data area, which works because C doesn't use the "1" for
>> anything except determining the size of a statically-allocated
>> PyStringObject.  So with the exception of the null string (of which
>> there will be exactly one in your table), you're going to need to
>> allocate that space anyway, and you can't move it, it has to be the
>> last member of the struct.  If the compiler wastes space by allocating
>> to alignment boundaries, it wastes it, but in general you can't do
>> anything about it.  Or am I missing something?
>
> So on my system (64-bit linux, Python 2.5), sizeof(PyStringObject) is
> 40.  The actual data in the structure as defined is 37 bytes, but
> there are 3 pad bytes.
>
> When allocating string objects, it allocates "sizeof(PyStringObject) +
> size" bytes.  So allocating a 3 byte string will ask malloc for 43
> bytes, even though only 40 bytes are needed (this includes a byte for
> the null termination).  So it always wastes space even when padding
> isn't needed.
>
> This particular problem could be avoided by instead allocating
> "offsetof(PyStringObject, ob_sval) + 1 + size" bytes instead.

... and it appears that they've made this very fix for Python 2.7 and 3.1.

James.



More information about the bazaar mailing list