[MERGE][bug #300289] Avoid unnecessary work
John Arbash Meinel
john at arbash-meinel.com
Tue Nov 25 18:42:14 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> John Arbash Meinel wrote:
>> I haven't fully understood why, but when branching from a stacked branch
>> into a standalone branch, it turns out that this "set.difference_update"
>> was a significant portion of the overall time.
>
>> Adding this shortcut drops the time for "bzr branch stacked standalone"
>> from 3m9s down to 1m45s (approx 40% of the total time.)
>
>> My guess is that we are doing something like has_key() on the target
>> index (which is being built). (I see 203k calls to
>> BTreeBuilder.iter_entries().) Also, in a bit of testing that I did, it
>> seems that set.difference_update() *always* iterates the full target,
>> even when source is small.
>
>> For example "small_set.difference_update(large_set)" is relatively slow.
>> (Of course, calling it 172k times doesn't help.)
>
>> I'm still investigating more fixes for stacked branches, but this was a
>> small patch that made a big difference.
>
>> John
>> =:->
>
> I did end up trying to look into various set operations, and what is
> fast and what isn't. I thought the information might be generally
> interesting. For these results I have 3 objects "large" indicates that
> it has approx 25,000 entries, and "small" has 0.
>
> $ TIMEIT 'small_set.difference(large_set)'
> 0.191 usec
>
> $ TIMEIT 'small_set.difference(large_dict)'
> 0.21 usec
>
> $ TIMEIT 'small_set.difference(large_list)'
> 723 usec
>
> So set.difference() seems to be very fast if the other object supports
> lookups, otherwise it iterates everything. However, you have to be
> careful if you start with a large set:
>
> $ TIMEIT 'large_set.difference(small_set)'
> 6470 usec
>
>
> What else is odd is that the relationship seems to be reversed for
> "difference_update":
>
> $ TIMEIT 'large_set.difference_update(small_set)'
> 0.152 usec
>
> $ TIMEIT 'small_set.difference_update(large_set)'
> 1380 usec
>
> $ TIMEIT 'small_set.difference_update(large_list)'
> 706 usec
>
> $ TIMEIT 'small_set.difference_update(large_dict)'
> 1750 usec
>
> I presume that iterating lists is just faster than iterating sets, which
> is yet again faster than iterating a dict.
>
> The other function you can use is "symmetric_difference":
>
> $ TIMEIT 'large_set.symmetric_difference(small_set)'
> 6380 usec
>
> $ TIMEIT 'small_set.symmetric_difference(large_list)'
> 4950 usec
>
> $ TIMEIT 'small_set.symmetric_difference(large_set)'
> 3470 usec
>
> Basically, don't use symmetric_difference unless you really need its
> result, because it is always slow.
>
> Anyway, it seems like you could use this logic:
>
> if len(x) > len(y) or isinstance(y, list):
> x.difference_update(y)
> else: # when y is a dict or set
> x = x.difference(y)
Just to round out the tests, what about "intersection"?
$ TIMEIT 'large_set.intersection(small_set)'
0.196 usec
$ TIMEIT 'small_set.intersection(large_set)'
0.203 usec
$ TIMEIT 'small_set.intersection(large_dict)'
1910 usec # surprised me
$ TIMEIT 'small_set.intersection(large_list)'
872 usec
So it seems that set.intersection(set) is properly optimized, but not
set.intersection(dict).
Also, while set<=>set is faster, you don't want to create a set just for
this unless you will be using it for other things:
$ TIMEIT 'set(large_list)'
5110 usec
$ TIMEIT 'set(large_set)'
3470 usec
$ TIMEIT 'set(large_dict)'
4530 usec
frozenset has similar characteristics except for:
$ TIMEIT 'frozenset(large_frozenset)'
0.155 usec
And this is because "frozenset(large_frozenset) is large_frozenset".
Just like with tuples. Because we know it is immutable, we don't have to
do any copying.
What I don't really understand is why "set(large_set)" is so slow.
Couldn't they see it is the same type of object, and then just malloc &&
memcpy?
I guess dict(a_dict) is just as slow:
$ TIMEIT 'dict(large_dict)'
4820 usec
At least lists are faster:
$ TIMEIT 'list(large_list)'
356 usec
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkksRwYACgkQJdeBCYSNAAPCYQCdHJohu9o3NRDqoSu7qY47Uchc
iA0AoKmXswN6zHNqhM2aONew8WCC9sCZ
=xsUG
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list