[MERGE][1.0] Graph.heads() optimization... 1500 times faster

Fri Dec 7 07:19:06 GMT 2007

John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> In trying to work on Graph.find_difference() I found some sub-optimal code for
> Graph.heads(). The attached patch speeds up a heads() call for me from 11.4s to
> 7.48 milliseconds with a pack repository.
> 
[...]
> I've done some playing with Graph.find_difference() but I ran out of time, and
> felt like this was more than worth merging.

This change looks good to me.  And I love the subject line :)

bb:tweak

(Although it's almost bb approve, I'm almost just nitpicking)

> # Begin patch
> === modified file 'bzrlib/graph.py'
> --- bzrlib/graph.py	2007-12-04 00:34:34 +0000
> +++ bzrlib/graph.py	2007-12-06 21:51:33 +0000
> @@ -55,6 +55,13 @@
>      def get_parents(self, revisions):
>          return [self.ancestry.get(r, None) for r in revisions]
>  
> +    def preload_parents(self, revision_ids):
> +        """Cache the parents of these revisions, we will ask for them soon.
> +
> +        :return: parents that were not preloaded
> +        """
> +        return set(revisions).difference(revision_ids)
> +

This method looks broken; isn't revisions an unbound variable here?

>  class _StackedParentsProvider(object):
>  
> @@ -87,6 +94,75 @@
>                  break
>          return [found.get(r, None) for r in revision_ids]
>  
> +    def preload_parents(self, revision_ids):
> +        """Cache the parents of these revisions, we will ask for them soon.
> +
> +        :return: parents that are actually present
> +        """

You seem to vary in the description of the return value for preload_parents.
What does “actually present” mean anyway, does it mean before or after
preloading?

> +        to_preload = set(revision_ids)
> +        for parent_provider in self._parent_providers:
> +            to_preload = parent_provider.preload_parents(to_preload)
> +            if not to_preload:
> +                break
> +        return to_preload

A comment, or possibly note in the docstring, about what preloading means on a
_StackedParentsProvider would be good I think.  I'm not sure this behaviour is
obvious.

[...]
> @@ -106,6 +182,12 @@
>              conforming to the behavior of StackedParentsProvider.get_parents
>          """
>          self.get_parents = parents_provider.get_parents
> +        #self.preload_parents = parents_provider.preload_parents

No need to keep commented out code lying around.

> === modified file 'bzrlib/index.py'
> --- bzrlib/index.py	2007-11-28 00:59:30 +0000
> +++ bzrlib/index.py	2007-12-06 20:40:25 +0000
> @@ -1024,6 +1024,10 @@
>                  result.append(None)
>          return result
>  
> +    def preload_parents(self, revision_ids):
> +        """See StackedParentsProvider.preload_parents"""
> +        return set(revision_ids)

Why see StackedParentsProvider's doctring for this rather than any other
implementation?  Also, why is this the only location that doesn't duplicate the
docstring.

>  class TestGraph(TestCaseWithMemoryTransport):
>  
> @@ -296,6 +320,16 @@
>          self.assertEqual((set([]), set(['rev3b', 'rev2b'])),
>                           graph.find_difference('rev2a', 'rev3b'))
>  
> +    def test_graph_difference_extended_history(self):
> +        graph = self.make_graph(extended_history_shortcut)
> +        self.expectFailure('find_difference is confused by shortcuts',
> +            self.assertEqual, (set(['e']), set(['f'])),
> +            graph.find_difference('e', 'f'))
> +        self.assertEqual((set(['e']), set(['f'])),
> +                         graph.find_difference('e', 'f'))
> +        self.assertEqual((set(['f']), set(['e'])),
> +                         graph.find_difference('f', 'e'))
> +

Thanks for adding this.  (Even though it's not directly related to the change in
this patch.)

> +
> +
> +class TestCachingParentsProvider(tests.TestCase):
> +
> +    def test_get_parents(self):
> +        dict_pp = _mod_graph.DictParentsProvider({'a':('b',)})
> +        inst_pp = InstrumentedParentsProvider(dict_pp)
> +        caching_pp = _mod_graph.CachingParentsProvider(inst_pp)
> +
> +        self.assertEqual({}, caching_pp._cache)
> +        self.assertEqual([('b',)], caching_pp.get_parents(['a']))
> +        self.assertEqual(['a'], inst_pp.calls)
> +        self.assertEqual([('b',)], caching_pp.get_parents(['a']))
> +        # No new call, as it should have been returned from the cache
> +        self.assertEqual(['a'], inst_pp.calls)
> +        self.assertEqual({'a':('b',)}, caching_pp._cache)
> +
> +        self.assertEqual([None], caching_pp.get_parents(['b']))
> +        self.assertEqual(['a', 'b'], inst_pp.calls)
> +        self.assertEqual([None], caching_pp.get_parents(['b']))
> +        self.assertEqual(['a', 'b'], inst_pp.calls)
> +        self.assertEqual({'a':('b',), 'b':None}, caching_pp._cache)
> +
> +        self.assertEqual([('b',), None, None, ('b',)],
> +                         caching_pp.get_parents(['a', 'b', 'c', 'a']))
> +        # Only 'c' should be requested anew
> +        self.assertEqual(['a', 'b', 'c'], inst_pp.calls)

Personally, I'd make three different test methods out of this method; one for
caching of successful lookups, one for caching of negative lookups, and one for
caching when asking for many parents, some of which are already cached.  You've
already grouped the code in this method that way, you may as well take the next
step and make your intent explicit :)

> +    def test_get_parents_repeated(self):
> +        dict_pp = _mod_graph.DictParentsProvider({'a':('b',)})
> +        inst_pp = InstrumentedParentsProvider(dict_pp)
> +        caching_pp = _mod_graph.CachingParentsProvider(inst_pp)

You do this setup a lot.  Make a helper for it — or just put it setUp.

> +    def test_preload_parents(self):
> +        dict_pp = _mod_graph.DictParentsProvider({'a':('b',)})
> +        inst_pp = InstrumentedParentsProvider(dict_pp)
> +        caching_pp = _mod_graph.CachingParentsProvider(inst_pp)
> +
> +        self.assertEqual({}, caching_pp._cache)
> +        self.assertEqual(set(), caching_pp.preload_parents(['a', 'b']))
> +        self.assertEqual(['a', 'b'], inst_pp.calls)
> +        self.assertEqual([('b',)], caching_pp.get_parents(['a']))
> +        self.assertEqual([None], caching_pp.get_parents(['b']))
> +        # No new call, as it should have been returned from the cache
> +        self.assertEqual(['a', 'b'], inst_pp.calls)
> +        self.assertEqual({'a':('b',), 'b':None}, caching_pp._cache)

[One slightly odd thing about these tests is how you do "self.assertFoo(...,
some_call())", but then follow it with an assertion that relies on the
side-effect of some_call.  I think that's ok because these tests are generally
pretty simple, it's just something I haven't seen in test code before.]

-Andrew.