[MERGE] Fix for poor bzr rm * performance in trunk

Wed May 6 06:50:10 BST 2009

Thanks for the quick review John!

Here's an updated patch with the following new changes in it:
* Existing unit test updated.
* The resulting test failure fixed.
* Updated the NEWS file.

I've also sent a copyright-assignment e-mail to mbp at canonical.com as
requested on http://bazaar-vcs.org/BzrDevelopment.

  Cheers //Johan

2009/5/5 John Arbash Meinel <john at arbash-meinel.com>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Johan Walles wrote:
>> Here's a patch that fixes the poor bzr rm * performance in bzr trunk:
>> https://bugs.launchpad.net/bugs/180116
>>
>> The patch does two things:
>> * It calls osutils.minimum_path_selection() from within
>> workingtree_4.iter_changes() instead of using a copy of
>> osutils.minimum_path_selection() in there.
>> * It converts osutils.minimum_path_selection() into an O(n log n) algorithm.
>>
>> Timings of bzr trunk without this patch (notice how it's slower than
>> 1.13.1, not my fault though :-):
>> $ time rm --keep *
>> real    0m11.291s
>> user    0m7.304s
>> sys     0m3.880s
>>
>> New timings on my system with this patch applied:
>> $ time bzr rm *
>> real    0m11.124s
>> user    0m7.088s
>> sys     0m3.920s
>>
>> $ time bzr rm --keep *
>> real    0m9.936s
>> user    0m6.768s
>> sys     0m3.076s
>>
>> So bzr rm now performs about the same with and without --keep.  As a
>> side effect, anybody else calling osutils.minimum_path_selection() will
>> be faster as well.
>>
>>  Have fun :-) //Johan
>
>
> I'm happy to see this being worked on.
>
> +    search_paths = []
> +    sorted_paths = list(paths)
> +    sorted_paths.sort()
>
> ^- I think you can just do "sorted_paths = sorted(paths)"
>
> Which should be approximately the same, but is a bit more
> straightforward (and possibly optimizable by python, etc.)
>
>
> +    for path in sorted_paths:
> +        if len(search_paths) == 0:
> +            # Result is empty, add first path
> +            search_paths.append(path)
> +            continue
> +        if not is_inside(search_paths[-1], path):
> +            # This path is unique, add it
> +            search_paths.append(path)
> +            continue
> +    return set(search_paths)
>
> ^- So I believe the only question is whether this optimization is always
> legal. Once sorted, will a containing entry always be the directly
> previous entry added to search_paths....
>
> My quick answer is that it might depend on the sorting. Consider:
>
>  a
>  a/b
>  a-b
>  a0b
>
> if you sort this, you get:
>
>  a
>  a-b
>  a/b
>  a0b
>
> See the potential problem there? If I understand it correctly, 'a/b'
> won't be properly filtered from 'a', because 'a-b' will be the last
> added to the unique search paths.
>
> I think the correct fix is to do:
>
> def sort_key(path):
>  return path.split('/')
> sorted_paths = sorted(paths, key=sort_key)
>
> I think it is important for us to add some test cases for these
> edge-cases in bzrlib/tests/test_osutils.py
>
> BB:resubmit
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkoASiIACgkQJdeBCYSNAAOw6wCfc/cj2vF1jdhxe9IKaEOUDVUF
> mq8An2K4K7KofLdfnFJRmwdD5goggmwg
> =mI6V
> -----END PGP SIGNATURE-----
>