Rethinking intern() for python
James Henstridge
james at jamesh.id.au
Wed Apr 8 11:01:55 BST 2009
On Wed, Apr 8, 2009 at 5:49 PM, James Henstridge <james at jamesh.id.au> wrote:
> On Wed, Apr 8, 2009 at 12:42 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> John Arbash Meinel writes:
>>
>> > The current string definition is:
>> > typedef struct {
>> > Py_ssize_t ob_refcnt; 32/64bit counter
>> > PyTypeObject *ob_type; 32/64bit pointer
>> > Py_ssize_t ob_size; 32/64bit counter
>> > long ob_shash; 32/64bit hash
>> > int ob_sstate; 32bit state
>> > char ob_sval[1]; 8bit space for NULL
>> > } PyStringObject;
>> >
>> > ob_sstate is taking up 4-bytes just to store one of the values 0, 1, 2,
>> > to indicate the INTERN state of the string. Which is 3-bytes of direct
>> > waste.
>> >
>> > Also, at least with my compiler, the 1 byte for ob_sval[] actually
>> > causes the sizeof(PyStringObject) to get 4 more bytes. Which means that
>> > every malloc() is over-allocating *another* 3 bytes.
>>
>> I don't think so. This looks like the usual C idiom for a variable-
>> sized data area, which works because C doesn't use the "1" for
>> anything except determining the size of a statically-allocated
>> PyStringObject. So with the exception of the null string (of which
>> there will be exactly one in your table), you're going to need to
>> allocate that space anyway, and you can't move it, it has to be the
>> last member of the struct. If the compiler wastes space by allocating
>> to alignment boundaries, it wastes it, but in general you can't do
>> anything about it. Or am I missing something?
>
> So on my system (64-bit linux, Python 2.5), sizeof(PyStringObject) is
> 40. The actual data in the structure as defined is 37 bytes, but
> there are 3 pad bytes.
>
> When allocating string objects, it allocates "sizeof(PyStringObject) +
> size" bytes. So allocating a 3 byte string will ask malloc for 43
> bytes, even though only 40 bytes are needed (this includes a byte for
> the null termination). So it always wastes space even when padding
> isn't needed.
>
> This particular problem could be avoided by instead allocating
> "offsetof(PyStringObject, ob_sval) + 1 + size" bytes instead.
... and it appears that they've made this very fix for Python 2.7 and 3.1.
James.
More information about the bazaar
mailing list