Rethinking intern() for python
James Henstridge
james at jamesh.id.au
Wed Apr 8 10:49:43 BST 2009
On Wed, Apr 8, 2009 at 12:42 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> John Arbash Meinel writes:
>
> > The current string definition is:
> > typedef struct {
> > Py_ssize_t ob_refcnt; 32/64bit counter
> > PyTypeObject *ob_type; 32/64bit pointer
> > Py_ssize_t ob_size; 32/64bit counter
> > long ob_shash; 32/64bit hash
> > int ob_sstate; 32bit state
> > char ob_sval[1]; 8bit space for NULL
> > } PyStringObject;
> >
> > ob_sstate is taking up 4-bytes just to store one of the values 0, 1, 2,
> > to indicate the INTERN state of the string. Which is 3-bytes of direct
> > waste.
> >
> > Also, at least with my compiler, the 1 byte for ob_sval[] actually
> > causes the sizeof(PyStringObject) to get 4 more bytes. Which means that
> > every malloc() is over-allocating *another* 3 bytes.
>
> I don't think so. This looks like the usual C idiom for a variable-
> sized data area, which works because C doesn't use the "1" for
> anything except determining the size of a statically-allocated
> PyStringObject. So with the exception of the null string (of which
> there will be exactly one in your table), you're going to need to
> allocate that space anyway, and you can't move it, it has to be the
> last member of the struct. If the compiler wastes space by allocating
> to alignment boundaries, it wastes it, but in general you can't do
> anything about it. Or am I missing something?
So on my system (64-bit linux, Python 2.5), sizeof(PyStringObject) is
40. The actual data in the structure as defined is 37 bytes, but
there are 3 pad bytes.
When allocating string objects, it allocates "sizeof(PyStringObject) +
size" bytes. So allocating a 3 byte string will ask malloc for 43
bytes, even though only 40 bytes are needed (this includes a byte for
the null termination). So it always wastes space even when padding
isn't needed.
This particular problem could be avoided by instead allocating
"offsetof(PyStringObject, ob_sval) + 1 + size" bytes instead.
James.
More information about the bazaar
mailing list