[MERGE] Packs. Kthxbye.

Fri Oct 19 02:46:18 BST 2007

John Arbash Meinel wrote:
[...]
> 
> ^- I used to think the idiom
> request_groups.setdefault(index, []).append((key, value, references))
> would be faster.
> 
> I did some timing, though, and at least for integer keys, doing
> 
> import random
> 
> unique_vals = 100
> total_count = 100 * 1000
> data = [random.randint(0, unique_vals) for x in xrange(total_count)]
> 
> # Method 1
> d = {}
> for x in data:
>   if x not in d:
>     d[x] = []
>   d[x].append(x)
> 
> # Method 2
> d = {}
> for x in data:
>   d.setdefault(x, []).append(x)

If you change this to:

# Method 2 tweaked
d = {}
ds = d.setdefault
for x in data:
  ds(x, []).append(x)

I find that the Method 2 does better than Method 1 in this test.  (By about the
same margin that 1 was beating 2 before the tweak.)

[
> So it seems a hash lookup for integers is faster than allocating a temporary list.

Well, allocating a list *and* looking up an attribute.

Of course, you can apply the same trick to Method 1...

# Method 1 tweaked
# d holds 'append' bound methods.  To access the original list, do d[x].__self__
# rather than just d[x].
d = {}
for x in data:
  if x not in d:
    d[x] = [].append
  d[x](x)

Which is faster than my tweaked Method 2.

> I wonder if that holds true for strings, though.

It's easy to modify your script to test: just add a “data = map(str, data)”.

You may be interested to know that string objects cache their hash value.  So
repeatedly hashing a string is basically as cheap as hashing it once.

-Andrew.