[MERGE] Move RemoteRepository caching logic to CachedParentsProvider

Wed Dec 10 04:06:30 GMT 2008

Aaron Bentley wrote:
> Andrew Bennetts wrote:
> > I think this is because it unfortunately isn't safe to cache misses in
> > RemoteRepository, because the cache of misses doesn't get invalidated when
> > revisions are added to its _real_repository.
> 
> I have made miss-caching optional, and disabled it for repository caches.
> 
> All tests pass with these changes.

bb:tweak

Looks good, although I spotted a bug.  And I'm feeling a bit fussy about
tests today... See comments inline.

[...]
> === modified file 'bzrlib/graph.py'
> --- bzrlib/graph.py	2008-10-30 15:20:00 +0000
> +++ bzrlib/graph.py	2008-12-03 05:09:44 +0000
> @@ -99,42 +99,84 @@
>  
>  
>  class CachingParentsProvider(object):
[...]
> +        :param parent_provider: The ParentProvider to use.  It or
> +            get_parent_map must be supplied.
> +        :param get_parent_map: The get_parent_map callback to use.  It or
> +            parent_provider must be supplied.

I've occasionally had APIs like this, i.e. where exactly one of a set of
params must be passed.  Most recently in bzrlib.smart.client, IIRC.  It'd be
nice if Python allowed us to write:

    assert_exactly_one_is_not_none(parent_provider, get_parent_map)

Well, writing that isn't hard — but producing a helpful exception message
when it fails is :)

I guess this version is feasible, and only a little bit ugly:

    assert_exactly_one_is_not_none(
        locals(), 'parent_provider', 'get_parent_map')

Anyway, there's nothing about that that I think you should do to land this
branch, I'm just thinking out loud.

> +        :param debug: If true, mutter debugging messages.
> +        """
>          self._real_provider = parent_provider
> -        # Theoretically we could use an LRUCache here
> -        self._cache = {}
[...]
> +        self._parents_map = {}

I think I preferred the old variable name.  It's not a big deal though, so
if you prefer it this way I don't mind.  The comment about using an
LRUCache still seems applicable, though.

> +        self._cache_misses = True
> +        self._debug = debug
> +        if self._debug:
> +            self._requested_parents = None
>  
>      def __repr__(self):
>          return "%s(%r)" % (self.__class__.__name__, self._real_provider)
>  
> +    def enable_cache(self, cache_misses=True):
> +        """Enable cache."""
> +        self._parents_map = {}
> +        self._cache_misses = cache_misses
> +        if self._debug:
> +            self._requested_parents = set()

enable_cache not only enables caching, it also clears an already enabled
cache.  I think that's a bit dangerous.  How about making it raise
AssertionError if self._parents_map is not None?

If caching is optional, then the constructor's docstring should explicitly
state whether caching is initially enabled or disabled.

> +    def disable_cache(self):
> +        """Disable cache."""
> +        self._parents_map = None
> +        if self._debug:
> +            self._requested_parents = None

This method also clears the cache.  I think that's reasonable behaviour, but
you should mention it in the docstring.

>      def get_parent_map(self, keys):
> -        """See _StackedParentsProvider.get_parent_map"""
[...]
> +        """See RemoteRepository.get_parent_map."""

It's not a big deal, but I think this docstring should refer to something
other than RemoteRepository, probably _StackedParentsProvider.  If only we
had a ParentProvider abstract base class/interface class, so there was a
clearly canonical place to document get_parent_map.  Oh well.

> +        # Hack to build up the caching logic.
> +        ancestry = self._parents_map
> +        if ancestry is None:
> +            # Caching is disabled.
> +            missing_revisions = set(keys)
> +            ancestry = {}
> +        else:
> +            missing_revisions = set(key for key in keys if key not in ancestry)
> +        if missing_revisions:
> +            parent_map = self._get_parent_map(missing_revisions)
> +            if self._debug:
> +                mutter('re-retrieved revisions: %d of %d',
> +                        len(set(ancestry).intersection(parent_map)),
> +                        len(parent_map))
> +            ancestry.update(parent_map)
> +            if self._cache_misses:
> +                ancestry.update(dict((k, None) for k in missing_revisions
> +                                     if k not in parent_map))

Add a comment here to say that None is never a valid parents list, so None
is a safe value to use to record a miss.

> === modified file 'bzrlib/remote.py'
> --- bzrlib/remote.py	2008-12-01 23:50:52 +0000
> +++ bzrlib/remote.py	2008-12-03 05:09:44 +0000
> @@ -287,17 +287,6 @@
[...]
> @@ -481,7 +470,8 @@
>              other_repository.bzrdir.transport.base !=
>              self.bzrdir.transport.base):
>              parents_provider = graph._StackedParentsProvider(
> -                [parents_provider, other_repository._make_parents_provider()])
> +                [self._unstacked_provider,
> +                 other_repository._make_parents_provider()])
>          return graph.Graph(parents_provider)

Is this change correct?  It appears that if other_repository is passed to a
RemoteRepository's get_graph, then the RemoteRepository's fallback
repository's won't be used.  Is there a test for this case?  It looks like
bzrlib.tests.per_repository.test_add_fallback_repository exercises
get_parent_map but not get_graph().get_parent_map.  In short, if I extend
test_add_fallback_repository to cover this case, it fails:

=== modified file 'bzrlib/tests/per_repository/test_add_fallback_repository.py'

--- bzrlib/tests/per_repository/test_add_fallback_repository.py	2008-09-04 20:32:04 +0000
+++ bzrlib/tests/per_repository/test_add_fallback_repository.py	2008-12-10 02:31:19 +0000
@@ -16,7 +16,11 @@
 
 """Tests for Repository.add_fallback_repository."""
 
-from bzrlib import errors
+from bzrlib import (
+    bzrdir,
+    errors,
+    remote,
+    )
 from bzrlib.revision import NULL_REVISION
 from bzrlib.tests import TestNotApplicable
 from bzrlib.tests.per_repository import TestCaseWithRepository
@@ -25,7 +29,14 @@
 class TestAddFallbackRepository(TestCaseWithRepository):
 
     def test_add_fallback_repository(self):
-        repo = self.make_repository('repo')
+        if isinstance(self.repository_format, remote.RemoteRepositoryFormat):
+            # RemoteRepository by default builds a default format real
+            # repository, but the default format is unstackble.  So explicitly
+            # make a stackable real repository and use that.
+            repo = self.make_repository('repo', format='1.9')
+            repo = bzrdir.BzrDir.open(self.get_url('repo')).open_repository()
+        else:
+            repo = self.make_repository('repo')
         tree = self.make_branch_and_tree('branch')
         if not repo._format.supports_external_lookups:
             self.assertRaises(errors.UnstackableRepositoryFormat,
@@ -45,3 +56,10 @@
         # ... or on the repository directly...
         self.assertEqual({revision_id: (NULL_REVISION,)},
             repo.get_parent_map([revision_id]))
+        # ... or on the repository's graph.
+        self.assertEqual({revision_id: (NULL_REVISION,)},
+            repo.get_graph().get_parent_map([revision_id]))
+        # ... or on the repository's graph, when there is an other repository.
+        other = self.make_repository('other')
+        self.assertEqual({revision_id: (NULL_REVISION,)},
+            repo.get_graph(other).get_parent_map([revision_id]))

(As you can see, this patch also fixes this test so that it actually applies
to RemoteRepository!  And ideally I think there should be multiple test methods
rather than piggy-backing lots of tests on the end of one method, but I
don't think that's worth stalling this branch for.)

AFAICT, RemoteRepository.get_graph should just be identical to
Repository.get_graph.

> === modified file 'bzrlib/tests/test_graph.py'
> --- bzrlib/tests/test_graph.py	2008-10-30 15:20:00 +0000
> +++ bzrlib/tests/test_graph.py	2008-12-03 05:09:44 +0000
> @@ -1379,13 +1379,13 @@
>  
>      def test_get_parent_map(self):
>          """Requesting the same revision should be returned from cache"""
> -        self.assertEqual({}, self.caching_pp._cache)
> +        self.assertEqual({}, self.caching_pp.get_cached_map())
>          self.assertEqual({'a':('b',)}, self.caching_pp.get_parent_map(['a']))
>          self.assertEqual(['a'], self.inst_pp.calls)
>          self.assertEqual({'a':('b',)}, self.caching_pp.get_parent_map(['a']))
>          # No new call, as it should have been returned from the cache
>          self.assertEqual(['a'], self.inst_pp.calls)
> -        self.assertEqual({'a':('b',)}, self.caching_pp._cache)
> +        self.assertEqual({'a':('b',)}, self.caching_pp.get_cached_map())
>  
>      def test_get_parent_map_not_present(self):
>          """The cache should also track when a revision doesn't exist"""
> @@ -1394,7 +1394,7 @@
>          self.assertEqual({}, self.caching_pp.get_parent_map(['b']))
>          # No new calls
>          self.assertEqual(['b'], self.inst_pp.calls)
> -        self.assertEqual({'b':None}, self.caching_pp._cache)
> +        self.assertEqual({'b':None}, self.caching_pp._parents_map)

Why aren't these tests consistent about whether they use get_cached_map() or
_parents_map?  I think they should be consistent, probably in favour of
get_cached_map().

> +class TestCachingParentsProviderExtras(tests.TestCaseWithTransport):

“Extras”  What are “Extras” here?  A docstring is called for.

Reading ahead, it looks like these are tests for how CachingParentsProvider
behaves when one of its parents providers returns unrelated results.  The
test case name is fine, but an English description in a docstring is needed.
My first thought was “extra behaviours not tested elsewhere”.

> +    def test_cached(self):
> +        self.assertEqual({}, self.caching_pp.get_cached_map())
> +        self.assertEqual({'rev1': []},
> +                         self.caching_pp.get_parent_map(['rev1']))
> +        self.assertEqual(['rev1'], self.inst_pp.calls)
> +        self.assertEqual({'rev1': [], 'rev2': ['rev1']},
> +                         self.caching_pp.get_cached_map())
> +        self.assertEqual({'rev1': []},
> +                          self.caching_pp.get_parent_map(['rev1']))
> +        self.assertEqual(['rev1'], self.inst_pp.calls)
> +        self.caching_pp.disable_cache()
> +        self.assertIs(None, self.caching_pp.get_cached_map())

This appears to be a couple of unit tests in a single method.  The first
line is an assertion, and is repeated as the first line of another test.
Make a new test method for it and delete it from test_cached and
test_cache_misses.

    def test_cache_initially_empty(self):
        self.assertEqual({}, self.caching_pp.get_cached_map())

Similarly, there's a test about the behaviour of disable_cache() clearing
the cache tacked on the end.  Make an explicit test for that, too.

    def test_disable_cache_empties_cached_map(self):
        # Put something in the cache
        self.caching_pp.get_parent_map(['rev1'])
        self.assertEqual(2, len(self.caching_pp.get_cached_map()))
        # Clear the cache with disable_cache
        self.caching_pp.disable_cache()
        self.assertIs(None, self.caching_pp.get_cached_map())

It's a bit nit-picky, but I think it makes test_cache much clearer if you
keep its focus very narrow.  And it discourages bad habits when the next
person goes to add to these tests.

> === modified file 'bzrlib/tests/test_remote.py'
> --- bzrlib/tests/test_remote.py	2008-11-25 17:31:02 +0000
> +++ bzrlib/tests/test_remote.py	2008-12-03 05:09:44 +0000
> @@ -1236,6 +1236,26 @@
>              repo.get_parent_map, ['a-revision-id'])
>  
>  
> +class TestGetParentMapAllowsNew(tests.TestCaseWithTransport):
> +
> +    def test_allows_new_revisions(self):
> +        """get_parent_map's results can be updated by commit."""
> +        smart_server = server.SmartTCPServer_for_testing()
> +        smart_server.setUp()
> +        self.addCleanup(smart_server.tearDown)
> +        self.make_branch('branch')
> +        branch = Branch.open(smart_server.get_url() + '/branch')
> +        tree = branch.create_checkout('tree', lightweight=True)
> +        tree.lock_write()
> +        self.addCleanup(tree.unlock)
> +        graph = tree.branch.repository.get_graph()
> +        # This provides an opportunity for the missing rev-id to be cached.
> +        self.assertEqual({}, graph.get_parent_map(['rev1']))
> +        tree.commit('message', rev_id='rev1')
> +        graph = tree.branch.repository.get_graph()
> +        self.assertEqual({'rev1': ('null:',)}, graph.get_parent_map(['rev1']))

I like this test.  Nothing needs to change here :)

-Andrew.