[MERGE] Move RemoteRepository caching logic to CachedParentsProvider

Wed Dec 10 20:26:32 GMT 2008

Andrew Bennetts wrote:
> Aaron Bentley wrote:
>> Andrew Bennetts wrote:
>>> I think this is because it unfortunately isn't safe to cache misses in
>>> RemoteRepository, because the cache of misses doesn't get invalidated when
>>> revisions are added to its _real_repository.
>> I have made miss-caching optional, and disabled it for repository caches.
>>
>> All tests pass with these changes.
> 
> bb:tweak
> 
> Looks good, although I spotted a bug.  And I'm feeling a bit fussy about
> tests today... See comments inline.
> 
> [...]
>> +        self._parents_map = {}
> 
> I think I preferred the old variable name.  It's not a big deal though, so
> if you prefer it this way I don't mind.

Hey, it's your variable name.  Changed.

> The comment about using an
> LRUCache still seems applicable, though.

I'm not sure whether an LRUCache allows caching misses.  But that
comment's been there for ages, no one's changed it, so it feels like litter.

>> +        self._cache_misses = True
>> +        self._debug = debug
>> +        if self._debug:
>> +            self._requested_parents = None
>>  
>>      def __repr__(self):
>>          return "%s(%r)" % (self.__class__.__name__, self._real_provider)
>>  
>> +    def enable_cache(self, cache_misses=True):
>> +        """Enable cache."""
>> +        self._parents_map = {}
>> +        self._cache_misses = cache_misses
>> +        if self._debug:
>> +            self._requested_parents = set()
> 
> enable_cache not only enables caching, it also clears an already enabled
> cache.  I think that's a bit dangerous.

How?  It's a cache.  It shouldn't contain anything we can't get again.

> How about making it raise
> AssertionError if self._parents_map is not None?

I don't think that the current policy against assertions is sensible.
But since there is one, I don't think I should be raising
AssertionErrors, either.

> If caching is optional, then the constructor's docstring should explicitly
> state whether caching is initially enabled or disabled.

Done.

>> +    def disable_cache(self):
>> +        """Disable cache."""
>> +        self._parents_map = None
>> +        if self._debug:
>> +            self._requested_parents = None
> 
> This method also clears the cache.  I think that's reasonable behaviour, but
> you should mention it in the docstring.

Done.

> 
>>      def get_parent_map(self, keys):
>> -        """See _StackedParentsProvider.get_parent_map"""
> [...]
>> +        """See RemoteRepository.get_parent_map."""
> 
> It's not a big deal, but I think this docstring should refer to something
> other than RemoteRepository, probably _StackedParentsProvider.

Done.

> If only we
> had a ParentProvider abstract base class/interface class, so there was a
> clearly canonical place to document get_parent_map.  Oh well.

Duck typing, man.  It's the way of the future.

>> +        # Hack to build up the caching logic.
>> +        ancestry = self._parents_map
>> +        if ancestry is None:
>> +            # Caching is disabled.
>> +            missing_revisions = set(keys)
>> +            ancestry = {}
>> +        else:
>> +            missing_revisions = set(key for key in keys if key not in ancestry)
>> +        if missing_revisions:
>> +            parent_map = self._get_parent_map(missing_revisions)
>> +            if self._debug:
>> +                mutter('re-retrieved revisions: %d of %d',
>> +                        len(set(ancestry).intersection(parent_map)),
>> +                        len(parent_map))
>> +            ancestry.update(parent_map)
>> +            if self._cache_misses:
>> +                ancestry.update(dict((k, None) for k in missing_revisions
>> +                                     if k not in parent_map))
> 
> Add a comment here to say that None is never a valid parents list, so None
> is a safe value to use to record a miss.

Done.

>> === modified file 'bzrlib/remote.py'
>> --- bzrlib/remote.py	2008-12-01 23:50:52 +0000
>> +++ bzrlib/remote.py	2008-12-03 05:09:44 +0000
>> @@ -287,17 +287,6 @@
> [...]
>> @@ -481,7 +470,8 @@
>>              other_repository.bzrdir.transport.base !=
>>              self.bzrdir.transport.base):
>>              parents_provider = graph._StackedParentsProvider(
>> -                [parents_provider, other_repository._make_parents_provider()])
>> +                [self._unstacked_provider,
>> +                 other_repository._make_parents_provider()])
>>          return graph.Graph(parents_provider)
> 
> Is this change correct?

No, it appears not.

> === modified file 'bzrlib/tests/per_repository/test_add_fallback_repository.py'
> --- bzrlib/tests/per_repository/test_add_fallback_repository.py	2008-09-04 20:32:04 +0000
> +++ bzrlib/tests/per_repository/test_add_fallback_repository.py	2008-12-10 02:31:19 +0000
> @@ -16,7 +16,11 @@
>  
>  """Tests for Repository.add_fallback_repository."""
>  
> -from bzrlib import errors
> +from bzrlib import (
> +    bzrdir,
> +    errors,
> +    remote,
> +    )
>  from bzrlib.revision import NULL_REVISION
>  from bzrlib.tests import TestNotApplicable
>  from bzrlib.tests.per_repository import TestCaseWithRepository
> @@ -25,7 +29,14 @@
>  class TestAddFallbackRepository(TestCaseWithRepository):
>  
>      def test_add_fallback_repository(self):
> -        repo = self.make_repository('repo')
> +        if isinstance(self.repository_format, remote.RemoteRepositoryFormat):
> +            # RemoteRepository by default builds a default format real
> +            # repository, but the default format is unstackble.  So explicitly
> +            # make a stackable real repository and use that.
> +            repo = self.make_repository('repo', format='1.9')
> +            repo = bzrdir.BzrDir.open(self.get_url('repo')).open_repository()
> +        else:
> +            repo = self.make_repository('repo')
>          tree = self.make_branch_and_tree('branch')
>          if not repo._format.supports_external_lookups:
>              self.assertRaises(errors.UnstackableRepositoryFormat,
> @@ -45,3 +56,10 @@
>          # ... or on the repository directly...
>          self.assertEqual({revision_id: (NULL_REVISION,)},
>              repo.get_parent_map([revision_id]))
> +        # ... or on the repository's graph.
> +        self.assertEqual({revision_id: (NULL_REVISION,)},
> +            repo.get_graph().get_parent_map([revision_id]))
> +        # ... or on the repository's graph, when there is an other repository.
> +        other = self.make_repository('other')
> +        self.assertEqual({revision_id: (NULL_REVISION,)},
> +            repo.get_graph(other).get_parent_map([revision_id]))
> 
> (As you can see, this patch also fixes this test so that it actually applies
> to RemoteRepository!  And ideally I think there should be multiple test methods
> rather than piggy-backing lots of tests on the end of one method, but I
> don't think that's worth stalling this branch for.)
> 
> AFAICT, RemoteRepository.get_graph should just be identical to
> Repository.get_graph.

I disagree.  It doesn't make much sense to put a StackedParentsProvider
inside a StackedParentsProvider.  We should just create a
StackedParentsProvider that includes:
- self._unstacked_parents_provider
- all fallback providers
- other_repository's parents provider

RemoteRepository.get_graph should assume that access to self is slower
than access to other_repository, and so it should put other_repository
first in the stacking order.

> 
>> === modified file 'bzrlib/tests/test_graph.py'
>> --- bzrlib/tests/test_graph.py	2008-10-30 15:20:00 +0000
>> +++ bzrlib/tests/test_graph.py	2008-12-03 05:09:44 +0000
>> @@ -1379,13 +1379,13 @@
>>  
>>      def test_get_parent_map(self):
>>          """Requesting the same revision should be returned from cache"""
>> -        self.assertEqual({}, self.caching_pp._cache)
>> +        self.assertEqual({}, self.caching_pp.get_cached_map())
>>          self.assertEqual({'a':('b',)}, self.caching_pp.get_parent_map(['a']))
>>          self.assertEqual(['a'], self.inst_pp.calls)
>>          self.assertEqual({'a':('b',)}, self.caching_pp.get_parent_map(['a']))
>>          # No new call, as it should have been returned from the cache
>>          self.assertEqual(['a'], self.inst_pp.calls)
>> -        self.assertEqual({'a':('b',)}, self.caching_pp._cache)
>> +        self.assertEqual({'a':('b',)}, self.caching_pp.get_cached_map())
>>  
>>      def test_get_parent_map_not_present(self):
>>          """The cache should also track when a revision doesn't exist"""
>> @@ -1394,7 +1394,7 @@
>>          self.assertEqual({}, self.caching_pp.get_parent_map(['b']))
>>          # No new calls
>>          self.assertEqual(['b'], self.inst_pp.calls)
>> -        self.assertEqual({'b':None}, self.caching_pp._cache)
>> +        self.assertEqual({'b':None}, self.caching_pp._parents_map)
> 
> Why aren't these tests consistent about whether they use get_cached_map() or
> _parents_map?

get_cached_map does not include cache misses.

> I think they should be consistent, probably in favour of
> get_cached_map().

Can't do it in favour of get_cached_map, so _parents_map (now _cache) it is.

>> +class TestCachingParentsProviderExtras(tests.TestCaseWithTransport):
> 
> “Extras”  What are “Extras” here?  A docstring is called for.

Done.

> This appears to be a couple of unit tests in a single method.  The first
> line is an assertion, and is repeated as the first line of another test.
> Make a new test method for it and delete it from test_cached and
> test_cache_misses.
> 
>     def test_cache_initially_empty(self):
>         self.assertEqual({}, self.caching_pp.get_cached_map())

Done.

> Similarly, there's a test about the behaviour of disable_cache() clearing
> the cache tacked on the end.  Make an explicit test for that, too.

Done.

Aaron