[MERGE][bug #300289] Avoid unnecessary work

John Arbash Meinel john at arbash-meinel.com
Tue Nov 25 18:42:14 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> John Arbash Meinel wrote:
>> I haven't fully understood why, but when branching from a stacked branch
>> into a standalone branch, it turns out that this "set.difference_update"
>> was a significant portion of the overall time.
> 
>> Adding this shortcut drops the time for "bzr branch stacked standalone"
>> from 3m9s down to 1m45s (approx 40% of the total time.)
> 
>> My guess is that we are doing something like has_key() on the target
>> index (which is being built). (I see 203k calls to
>> BTreeBuilder.iter_entries().) Also, in a bit of testing that I did, it
>> seems that set.difference_update() *always* iterates the full target,
>> even when source is small.
> 
>> For example "small_set.difference_update(large_set)" is relatively slow.
>> (Of course, calling it 172k times doesn't help.)
> 
>> I'm still investigating more fixes for stacked branches, but this was a
>> small patch that made a big difference.
> 
>> John
>> =:->
> 
> I did end up trying to look into various set operations, and what is
> fast and what isn't. I thought the information might be generally
> interesting. For these results I have 3 objects "large" indicates that
> it has approx 25,000 entries, and "small" has 0.
> 
> $ TIMEIT 'small_set.difference(large_set)'
>  0.191 usec
> 
> $ TIMEIT 'small_set.difference(large_dict)'
>  0.21 usec
> 
> $ TIMEIT 'small_set.difference(large_list)'
>  723 usec
> 
> So set.difference() seems to be very fast if the other object supports
> lookups, otherwise it iterates everything. However, you have to be
> careful if you start with a large set:
> 
> $ TIMEIT 'large_set.difference(small_set)'
>  6470 usec
> 
> 
> What else is odd is that the relationship seems to be reversed for
> "difference_update":
> 
> $ TIMEIT 'large_set.difference_update(small_set)'
>  0.152 usec
> 
> $ TIMEIT 'small_set.difference_update(large_set)'
>  1380 usec
> 
> $ TIMEIT 'small_set.difference_update(large_list)'
>  706 usec
> 
> $ TIMEIT 'small_set.difference_update(large_dict)'
>  1750 usec
> 
> I presume that iterating lists is just faster than iterating sets, which
> is yet again faster than iterating a dict.
> 
> The other function you can use is "symmetric_difference":
> 
> $ TIMEIT 'large_set.symmetric_difference(small_set)'
>   6380 usec
> 
> $ TIMEIT 'small_set.symmetric_difference(large_list)'
>   4950 usec
> 
> $ TIMEIT 'small_set.symmetric_difference(large_set)'
>   3470 usec
> 
> Basically, don't use symmetric_difference unless you really need its
> result, because it is always slow.
> 
> Anyway, it seems like you could use this logic:
> 
> if len(x) > len(y) or isinstance(y, list):
>   x.difference_update(y)
> else: # when y is a dict or set
>   x = x.difference(y)

Just to round out the tests, what about "intersection"?


$ TIMEIT 'large_set.intersection(small_set)'
  0.196 usec

$ TIMEIT 'small_set.intersection(large_set)'
  0.203 usec

$ TIMEIT 'small_set.intersection(large_dict)'
  1910 usec  # surprised me

$ TIMEIT 'small_set.intersection(large_list)'
  872 usec

So it seems that set.intersection(set) is properly optimized, but not
set.intersection(dict).

Also, while set<=>set is faster, you don't want to create a set just for
this unless you will be using it for other things:

$ TIMEIT 'set(large_list)'
  5110 usec

$ TIMEIT 'set(large_set)'
  3470 usec

$ TIMEIT 'set(large_dict)'
  4530 usec

frozenset has similar characteristics except for:

$ TIMEIT 'frozenset(large_frozenset)'
  0.155 usec

And this is because "frozenset(large_frozenset) is large_frozenset".
Just like with tuples. Because we know it is immutable, we don't have to
do any copying.

What I don't really understand is why "set(large_set)" is so slow.
Couldn't they see it is the same type of object, and then just malloc &&
memcpy?

I guess dict(a_dict) is just as slow:

$ TIMEIT 'dict(large_dict)'
  4820 usec

At least lists are faster:

$ TIMEIT 'list(large_list)'
  356 usec

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkksRwYACgkQJdeBCYSNAAPCYQCdHJohu9o3NRDqoSu7qY47Uchc
iA0AoKmXswN6zHNqhM2aONew8WCC9sCZ
=xsUG
-----END PGP SIGNATURE-----



More information about the bazaar mailing list