[MERGE][bug #300289] Avoid unnecessary work

John Arbash Meinel john at arbash-meinel.com
Mon Nov 24 22:31:35 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> I haven't fully understood why, but when branching from a stacked branch
> into a standalone branch, it turns out that this "set.difference_update"
> was a significant portion of the overall time.
> 
> Adding this shortcut drops the time for "bzr branch stacked standalone"
> from 3m9s down to 1m45s (approx 40% of the total time.)
> 
> My guess is that we are doing something like has_key() on the target
> index (which is being built). (I see 203k calls to
> BTreeBuilder.iter_entries().) Also, in a bit of testing that I did, it
> seems that set.difference_update() *always* iterates the full target,
> even when source is small.
> 
> For example "small_set.difference_update(large_set)" is relatively slow.
> (Of course, calling it 172k times doesn't help.)
> 
> I'm still investigating more fixes for stacked branches, but this was a
> small patch that made a big difference.
> 
> John
> =:->

I did end up trying to look into various set operations, and what is
fast and what isn't. I thought the information might be generally
interesting. For these results I have 3 objects "large" indicates that
it has approx 25,000 entries, and "small" has 0.

$ TIMEIT 'small_set.difference(large_set)'
 0.191 usec

$ TIMEIT 'small_set.difference(large_dict)'
 0.21 usec

$ TIMEIT 'small_set.difference(large_list)'
 723 usec

So set.difference() seems to be very fast if the other object supports
lookups, otherwise it iterates everything. However, you have to be
careful if you start with a large set:

$ TIMEIT 'large_set.difference(small_set)'
 6470 usec


What else is odd is that the relationship seems to be reversed for
"difference_update":

$ TIMEIT 'large_set.difference_update(small_set)'
 0.152 usec

$ TIMEIT 'small_set.difference_update(large_set)'
 1380 usec

$ TIMEIT 'small_set.difference_update(large_list)'
 706 usec

$ TIMEIT 'small_set.difference_update(large_dict)'
 1750 usec

I presume that iterating lists is just faster than iterating sets, which
is yet again faster than iterating a dict.

The other function you can use is "symmetric_difference":

$ TIMEIT 'large_set.symmetric_difference(small_set)'
  6380 usec

$ TIMEIT 'small_set.symmetric_difference(large_list)'
  4950 usec

$ TIMEIT 'small_set.symmetric_difference(large_set)'
  3470 usec

Basically, don't use symmetric_difference unless you really need its
result, because it is always slow.

Anyway, it seems like you could use this logic:

if len(x) > len(y) or isinstance(y, list):
  x.difference_update(y)
else: # when y is a dict or set
  x = x.difference(y)


I suppose it would be nice if python internals took care of this sort of
thing, but for now, it is something to be aware of.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkrK0YACgkQJdeBCYSNAAMewgCggvtrq0kX2v/G5jEOclJEPCrx
MEIAnjqhGSuQzoFg4Ye6cP7/F0Sn+A6q
=KH9v
-----END PGP SIGNATURE-----



More information about the bazaar mailing list