[performance note] set() packing

Tue Dec 11 14:28:30 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I just ran into something surprising, so I figured I would mention it here.

It seems that sets() in python are intentionally a little sparse. (Since they
are hash-tables, it wants to avoid collisions.)

If you look at the raw data structure, it costs 8 bytes for every node (a long
and a PyObject *) added to a set. Which might be 12 or 16 depending on if a
'long' expands on 64-bit platforms. (I don't *think* so, because we have long
long for that, but I don't know for sure.)

Anyway, on top of that, for sets with <50,000 records, the code is written to
keep the set() no more than 1/4th full, for >=50,000 records it is designed to
be 1/2 full (1/2 empty for pessimists).

And sets are also strictly designed to be powers of 2 in size. (They use a
bit-mask rather than mod, so it has to be 2^X.)

So if you have 8k nodes, it will allocate a 32k node table, or 256KB. If you go
to 9k nodes, it will allocate a 64k node table, or 512KB (.5MB).

With some simple code on my part building up sets of different ancestries in
bzr.dev, I ended up with an average of 21 bytes per node. With a set for every
merge node and its parents (about 7k revisions), I ended up using 1.8GB of RAM.
If I turn all of those sets into tuples, the ram consumption drops to 267MB.

I'm guessing that tuples just keep an array of pointers, so their per-node
overhead is probably closer to 4 bytes per node, rather than 16-32 after
expansion for sets.

It also means that if memory pressure is an issue, a single large set is a lot
better than multiple small ones.

Oh, and I found that doing:

  new = set([revision_id])
  for parent in parents:
    new.update(previous[parent])
  previous[revision_id] = new

Was causing a bit of memory fragmentation. If I created a new object at the end
with:

previous[revision_id] = set(new)
or
new.copy()
or
frozenset(new)

My memory dropped from 1.8GB (28 bytes per node) to 1.4GB (21 bytes per node).
I'm guessing that by forcibly reallocating, it was able to create a single
continuous memory section, rather than having the PySetObject data allocated in
a separate location from the data table.

Anyway, for what I was doing, I ended up just using tuples for the raw storage,
and putting them into sets when I needed the set functionality. It did slow the
operation down a bit, but not consuming 1.8GB of RAM meant my system didn't
start swapping either.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHXp6NJdeBCYSNAAMRAk1gAKC0KrBdbbJ4r1FKuIcL34/j8OQycQCgmTRn
0W50cxSPIA9ZiUS9N/CZffw=
=mnkE
-----END PGP SIGNATURE-----