[MERGE][1.0] Graph.heads() optimization... Now 1770x

John Arbash Meinel john at arbash-meinel.com
Fri Dec 7 20:12:42 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John Arbash Meinel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> In trying to work on Graph.find_difference() I found some sub-optimal code for
>> Graph.heads(). The attached patch speeds up a heads() call for me from 11.4s to
>> 7.48 milliseconds with a pack repository.
>>
> [...]
>> I've done some playing with Graph.find_difference() but I ran out of time, and
>> felt like this was more than worth merging.
> 
> This change looks good to me.  And I love the subject line :)
> 

...

>> +    def preload_parents(self, revision_ids):
>> +        """Cache the parents of these revisions, we will ask for them soon.
>> +
>> +        :return: parents that were not preloaded
>> +        """
>> +        return set(revisions).difference(revision_ids)
>> +
> 
> This method looks broken; isn't revisions an unbound variable here?

It probably was. But I realized I really agreed with Robert that preload wasn't
the right way to go about it. I wasn't very happy having to introduce a new api
when ParentsProviders were very ad-hoc. (Some repositories implemented it
themselves, some returned a helper object, None of the implementations
inherited from another implementation, etc.)

So I went ahead and change the functions so that step() actually takes a
dictionary of parents that it is going to care about. And made
"preload_parents" just a member of Graph. Which allows me to make it private,
and doesn't involve running into a bzr-svn implementation that isn't providing it.

The reason the original code couldn't was because _BreadthFirstSearcher was
defined as an iterator, and there was nowhere you could supply a parameter. But
making "step()" a separate function it was easy to do, I just needed to take
that step.

As a positive bonus, it drops the time for heads() down to 6.44ms, thus getting
another 200x performance. For knits it drops the time back down to 28ms which
is about 10% better than the preload() form.

> 
>>  class _StackedParentsProvider(object):
>>  
>> @@ -87,6 +94,75 @@
>>                  break
>>          return [found.get(r, None) for r in revision_ids]
>>  
>> +    def preload_parents(self, revision_ids):
>> +        """Cache the parents of these revisions, we will ask for them soon.
>> +
>> +        :return: parents that are actually present
>> +        """
> 
> You seem to vary in the description of the return value for preload_parents.
> What does “actually present” mean anyway, does it mean before or after
> preloading?

I changed my mind from returning what the function could cache, to what it
could not cache. With the idea that Stacked could then request more from
another source. However, it already had the right logic for "get_parents" so
why implement weird logic for preload... So I got rid of preload.


...

>> +    def preload_parents(self, revision_ids):
>> +        """See StackedParentsProvider.preload_parents"""
>> +        return set(revision_ids)
> 
> Why see StackedParentsProvider's doctring for this rather than any other
> implementation?  Also, why is this the only location that doesn't duplicate the
> docstring.

I had gotten tired of copy and paste. But as mentioned, its all gone now.


...

>> +    def test_graph_difference_extended_history(self):
>> +        graph = self.make_graph(extended_history_shortcut)
>> +        self.expectFailure('find_difference is confused by shortcuts',
>> +            self.assertEqual, (set(['e']), set(['f'])),
>> +            graph.find_difference('e', 'f'))
>> +        self.assertEqual((set(['e']), set(['f'])),
>> +                         graph.find_difference('e', 'f'))
>> +        self.assertEqual((set(['f']), set(['e'])),
>> +                         graph.find_difference('f', 'e'))
>> +
> 
> Thanks for adding this.  (Even though it's not directly related to the change in
> this patch.)

Of course, I was hoping to make it pass :)

> 
>> +
>> +
>> +class TestCachingParentsProvider(tests.TestCase):
>> +
...

>> +        self.assertEqual(['a', 'b', 'c'], inst_pp.calls)
> 
> Personally, I'd make three different test methods out of this method; one for
> caching of successful lookups, one for caching of negative lookups, and one for
> caching when asking for many parents, some of which are already cached.  You've
> already grouped the code in this method that way, you may as well take the next
> step and make your intent explicit :)

I think I split it up into 4 or so.

> 
>> +    def test_get_parents_repeated(self):
>> +        dict_pp = _mod_graph.DictParentsProvider({'a':('b',)})
>> +        inst_pp = InstrumentedParentsProvider(dict_pp)
>> +        caching_pp = _mod_graph.CachingParentsProvider(inst_pp)
> 
> You do this setup a lot.  Make a helper for it — or just put it setUp.

And moved this into setUp.


...

> 
> [One slightly odd thing about these tests is how you do "self.assertFoo(...,
> some_call())", but then follow it with an assertion that relies on the
> side-effect of some_call.  I think that's ok because these tests are generally
> pretty simple, it's just something I haven't seen in test code before.]
> 
> -Andrew.

So I was doing that to make sure that calling foo both returns the right value,
*and* has the correct side effects.

I could do it as:

value = call()
self.assertEqual(expected, value)
self.assertEqual(expected, side_effects)

I've just always done it inline, as I feel the temporary variable is wasted.
Especially since it might also do:

self.assertEqual(expected, call())
self.assertEqual(...sideeffect...)
self.assertEqual(expected, call()) # Same 'expected'

Where if you cache in a local variable, you might accidentally pass it to the
wrong variable in the second call, and the test succeeds for the wrong reason.

Though that is also a good reason to not chain tests with side-effects.

Anyway, the attached patch should address all of the review comments. Since
Martin is allowing this to be merged post 1.0rc2, I still have a chance to get
find_differences working for 1.0.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHWak6JdeBCYSNAAMRAlBTAJ9TU86MLc2jQ1gNAiVtrJOWOQtkKACfYRyS
hPliJVmItOiakA9IhNuHc5I=
=+M+T
-----END PGP SIGNATURE-----



More information about the bazaar mailing list