[MERGE][bug #300289] Avoid unnecessary work
John Arbash Meinel
john at arbash-meinel.com
Mon Nov 24 22:31:35 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> I haven't fully understood why, but when branching from a stacked branch
> into a standalone branch, it turns out that this "set.difference_update"
> was a significant portion of the overall time.
>
> Adding this shortcut drops the time for "bzr branch stacked standalone"
> from 3m9s down to 1m45s (approx 40% of the total time.)
>
> My guess is that we are doing something like has_key() on the target
> index (which is being built). (I see 203k calls to
> BTreeBuilder.iter_entries().) Also, in a bit of testing that I did, it
> seems that set.difference_update() *always* iterates the full target,
> even when source is small.
>
> For example "small_set.difference_update(large_set)" is relatively slow.
> (Of course, calling it 172k times doesn't help.)
>
> I'm still investigating more fixes for stacked branches, but this was a
> small patch that made a big difference.
>
> John
> =:->
I did end up trying to look into various set operations, and what is
fast and what isn't. I thought the information might be generally
interesting. For these results I have 3 objects "large" indicates that
it has approx 25,000 entries, and "small" has 0.
$ TIMEIT 'small_set.difference(large_set)'
0.191 usec
$ TIMEIT 'small_set.difference(large_dict)'
0.21 usec
$ TIMEIT 'small_set.difference(large_list)'
723 usec
So set.difference() seems to be very fast if the other object supports
lookups, otherwise it iterates everything. However, you have to be
careful if you start with a large set:
$ TIMEIT 'large_set.difference(small_set)'
6470 usec
What else is odd is that the relationship seems to be reversed for
"difference_update":
$ TIMEIT 'large_set.difference_update(small_set)'
0.152 usec
$ TIMEIT 'small_set.difference_update(large_set)'
1380 usec
$ TIMEIT 'small_set.difference_update(large_list)'
706 usec
$ TIMEIT 'small_set.difference_update(large_dict)'
1750 usec
I presume that iterating lists is just faster than iterating sets, which
is yet again faster than iterating a dict.
The other function you can use is "symmetric_difference":
$ TIMEIT 'large_set.symmetric_difference(small_set)'
6380 usec
$ TIMEIT 'small_set.symmetric_difference(large_list)'
4950 usec
$ TIMEIT 'small_set.symmetric_difference(large_set)'
3470 usec
Basically, don't use symmetric_difference unless you really need its
result, because it is always slow.
Anyway, it seems like you could use this logic:
if len(x) > len(y) or isinstance(y, list):
x.difference_update(y)
else: # when y is a dict or set
x = x.difference(y)
I suppose it would be nice if python internals took care of this sort of
thing, but for now, it is something to be aware of.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkrK0YACgkQJdeBCYSNAAMewgCggvtrq0kX2v/G5jEOclJEPCrx
MEIAnjqhGSuQzoFg4Ye6cP7/F0Sn+A6q
=KH9v
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list