[MERGE] Reconcile can fix bad parent references, Andrew's version

Tue Oct 2 05:12:40 BST 2007

On Sat, 2007-09-29 at 00:38 +1000, Andrew Bennetts wrote:
> === added file
> 'bzrlib/tests/repository_implementations/test_broken.py'
> ...
> +"""Tests that use BrokenRepoScenario objects.
> +
> +That is, tests for reconcile and check.
> +"""

We try to group tests by the functionality they test. This appears to be
grouping by how the tests were written, which will make them hard to
find. I suggest 
test_check_reconcile.py
(if you need tests of both functions to be conflated)

> +
> +from bzrlib.inventory import Inventory, InventoryFile
> +from bzrlib.revision import Revision
> +from bzrlib.tests import TestNotApplicable
> +from bzrlib.tests.repository_implementations import
> TestCaseWithRepository
> +
> +import sha

PEP-8, standard lib before local imports.

> +class TestReconcile(TestCaseWithRepository):
> +    """Tests for how reconcile corrects errors in parents of file
> versions."""

PEP-8, one line if possible, or one and a paragraph... and the class
name does not imply the narrow focus the docstring does. We know this is
a test class so perhaps
class TestFileParentReconciliation
    """The tests of incorrect file-version parents reconciliation."""

> +    def assertCheckScenario(self, scenario):
> +        repo =
> self.make_repository_using_factory(scenario.populate_repository)
> +        self.require_text_parent_corruption(repo)
> +        check_result = repo.check()
> +        check_result.report_results(verbose=True)
> +        for pattern in scenario.check_regexes():
> +            self.assertContainsRe(
> +                self._get_log(keep_log_file=True),
> +                pattern)

This method needs a docstring and a better name. either assert or check
is redundant. And scenario is incredibly generic here, if you want to go
hungarian (naming after what a thing is, not what its for), then perhaps
reconcile_scenario).

> +    def make_repository_using_factory(self, factory):
> +        """Create a new repository populated by the given factory."""
> +        repo = self.make_repository('broken-repo')
> +        repo.lock_write()
> +        try:
> +            repo.start_write_group()
> +            try:
> +                factory(repo)
> +                repo.commit_write_group()
> +                return repo
> +            except:
> +                repo.abort_write_group()
> +                raise
> +        finally:
> +            repo.unlock()

Perhaps 'make_populated_repository' ?

> +    def add_revision(self, repo, revision_id, inv, parent_ids):
> +        inv.revision_id = revision_id
> +        inv.root.revision = revision_id
> +        repo.add_inventory(revision_id, inv, parent_ids)
> +        revision = Revision(revision_id,
> committer='jrandom at example.com',
> +            timestamp=0, inventory_sha1='', timezone=0,
> message='foo',
> +            parent_ids=parent_ids)
> +        repo.add_revision(revision_id,revision, inv)

This could do with a docstring.

> +    def require_text_parent_corruption(self, repo):
> +        if not repo._reconcile_fixes_text_parents:
> +            raise TestNotApplicable(
> +                    "Format does not support text parent reconciliation")

perhaps tweak the name, 'require_repo_suffers_text_parent_corruption'.

> +    def file_parents(self, repo, revision_id):
> +        return repo.weave_store.get_weave('a-file-id',
> +            repo.get_transaction()).get_parents(revision_id)

This looks like something worth putting onto Repository.

> +    def assertReconcileResults(self, scenario):
> +        """Construct a repository and reconcile it, verifying the state before
> +        and after.
> +
> +        :param scenario: a Scenario to test reconcile on.
> +        """

going from the method name this is more than an assertion, it takes
active action.
I suggest 
checkReconcileBehaviour

> +        # The content of the versionedfile should be the same after
> the
> +        # reconcile.

'The content of the versions in the versionedfile ...'

> +    def test_reconcile(self):
> +        self.assertReconcileResults(self.scenario_class(self))
> +
> +    def test_check(self):
> +        self.assertCheckScenario(self.scenario_class(self))

These methods are very opaque.

> === added file 'bzrlib/tests/repository_implementations/test_check.py'
> --- bzrlib/tests/repository_implementations/test_check.py       1970-01-01 00:00:00 +0000
> +++ bzrlib/tests/repository_implementations/test_check.py       2007-09-28 12:56:56 +0000
> @@ -0,0 +1,135 @@
> +# Copyright (C) 2007 Canonical Ltd
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write to the Free Software
> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> +
> +
> +"""Test operations that check the repository for corruption"""
> +
> +
> +from bzrlib import (
> +    inventory,
> +    revision as _mod_revision,
> +    )
> +from bzrlib.repository import _RevisionTextVersionCache
> +from bzrlib.tests.repository_implementations import TestCaseWithRepository
> +
> +
> +class TestFindBadAncestors(TestCaseWithRepository):
> +
> +    def make_broken_repository(self):
> +        repo = self.make_repository('.')
> +        cleanups = []
> +        try:
> +            repo.lock_write()
> +            cleanups.append(repo.unlock)
> +            repo.start_write_group()
> +            cleanups.append(repo.commit_write_group)
> +            # make rev1a: A well-formed revision, containing 'file1'
> +            inv = inventory.Inventory(revision_id='rev1a')
> +            inv.root.revision = 'rev1a'
> +            self.add_file(repo, inv, 'file1', 'rev1a', [])
> +            repo.add_inventory('rev1a', inv, [])
> +            revision = _mod_revision.Revision('rev1a',
> +                committer='jrandom at example.com', timestamp=0,
> +                inventory_sha1='', timezone=0, message='foo', parent_ids=[])
> +            repo.add_revision('rev1a',revision, inv)
> +
> +            # make rev1b, which has no Revision, but has an Inventory, and
> +            # file1
> +            inv = inventory.Inventory(revision_id='rev1b')
> +            inv.root.revision = 'rev1b'
> +            self.add_file(repo, inv, 'file1', 'rev1b', [])
> +            repo.add_inventory('rev1b', inv, [])
> +
> +            # make rev2, with file1 and file2
> +            # file2 is sane
> +            # file1 has 'rev1b' as an ancestor, even though this is not
> +            # mentioned by 'rev1a', making it an unreferenced ancestor
> +            inv = inventory.Inventory()
> +            self.add_file(repo, inv, 'file1', 'rev2', ['rev1a', 'rev1b'])
> +            self.add_file(repo, inv, 'file2', 'rev2', [])
> +            self.add_revision(repo, 'rev2', inv, ['rev1a'])
> +
> +            # make ghost revision rev1c
> +            inv = inventory.Inventory()
> +            self.add_file(repo, inv, 'file2', 'rev1c', [])
> +
> +            # make rev3 with file2
> +            # file2 refers to 'rev1c', which is a ghost in this repository, so
> +            # file2 cannot have rev1c as its ancestor.
> +            inv = inventory.Inventory()
> +            self.add_file(repo, inv, 'file2', 'rev3', ['rev1c'])
> +            self.add_revision(repo, 'rev3', inv, ['rev1c'])
> +            return repo
> +        finally:
> +            for cleanup in reversed(cleanups):
> +                cleanup()
> +
> +    def add_revision(self, repo, revision_id, inv, parent_ids):
> +        inv.revision_id = revision_id
> +        inv.root.revision = revision_id
> +        repo.add_inventory(revision_id, inv, parent_ids)
> +        revision = _mod_revision.Revision(revision_id,
> +            committer='jrandom at example.com', timestamp=0, inventory_sha1='',
> +            timezone=0, message='foo', parent_ids=parent_ids)
> +        repo.add_revision(revision_id,revision, inv)
> +
> +    def add_file(self, repo, inv, filename, revision, parents):
> +        file_id = filename + '-id'
> +        entry = inventory.InventoryFile(file_id, filename, 'TREE_ROOT')
> +        entry.revision = revision
> +        entry.text_size = 0
> +        inv.add(entry)
> +        vf = repo.weave_store.get_weave_or_empty(file_id,
> +                                                 repo.get_transaction())
> +        vf.add_lines(revision, parents, ['line\n'])

The above logic looks very similar to that int he broken_repo tests, is
it duplicated?

> +    def find_bad_ancestors(self, file_id, revision_ids):
> +        repo = self.make_broken_repository()
> +        vf = repo.weave_store.get_weave(file_id, repo.get_transaction())
> +        return repo.find_bad_ancestors(revision_ids, file_id, vf,
> +                                       _RevisionTextVersionCache(repo))

Going from your patch description this method and tests would seem to be
redundant now, perhaps delete it ?

> === modified file 'NEWS'
> --- NEWS        2007-09-24 09:40:01 +0000
> +++ NEWS        2007-09-26 10:49:46 +0000
> @@ -25,7 +25,10 @@
>  
>    FEATURES:
>  
> -   * New ``reconfigure`` command (Aaron Bentley)
> +   * New ``reconfigure`` command. (Aaron Bentley)
> +
> +   * New ``revert --forget-merges`` command, which removes the record of a pending 
> +     merge without affecting the working tree contents.  (Martin Pool)

Why are these being changed by this patch?

> ... 
> +   * Fix 'unprintable error' message when displaying BzrCheckError and 
> +     some other exceptions on Python 2.5.
> +     (Martin Pool, #144633)

And this?

> @@ -116,6 +123,10 @@
>  
>    INTERNALS:
>  
> +    * ``bzrlib.transport.Transport.put_file`` now returns the number of bytes
> +      put by the method call, to allow avoiding stat-after-write or
> +      housekeeping in callers. (Robert Collins)

And this?

>     * New method ``bzrlib.osutils.minimum_path_selection`` useful for removing
>       duplication from user input, when a user mentions both a path and an item
>       contained within that path. (Robert Collins)
> @@ -333,6 +344,12 @@
>       include them as Concepts within the User Reference.
>       (Paul Moore, Ian Clatworthy)
>  
> +    * ``check`` can detect versionedfile parent references that are
> +      inconsistent with revision and inventory info, and ``reconcile`` can fix
> +      them.  These faulty references were generated by 0.8-era releases,
> +      so repositories which were manipulated by old bzrs should be
> +      checked, and possibly reconciled ASAP.  (Aaron Bentley)
...

> @@ -3048,19 +3063,24 @@
>  
>      _see_also = ['cat', 'export']
>      takes_options = [
> -            'revision',
> -            Option('no-backup', "Do not save backups of reverted files."),
> -            ]
> +        'revision',
> +        Option('no-backup', "Do not save backups of reverted files."),
> +        Option('forget-merges',
> +               'Remove pending merge marker, without changing any files.'),
> +        ]
>      takes_args = ['file*']
>  
> -    def run(self, revision=None, no_backup=False, file_list=None):
> -        if file_list is not None:
> -            if len(file_list) == 0:
> -                raise errors.BzrCommandError("No files specified")
> -        
> +    def run(self, revision=None, no_backup=False, file_list=None,
> +            forget_merges=None):
>          tree, file_list = tree_files(file_list)
> +        if forget_merges:
> +            tree.set_parent_ids(tree.get_parent_ids()[:1])
> +        else:
> +            self._revert_tree_to_revision(tree, revision, file_list, no_backup)
> +
> +    @staticmethod
> +    def _revert_tree_to_revision(tree, revision, file_list, no_backup):
>          if revision is None:
> -            # FIXME should be tree.last_revision
>              rev_id = tree.last_revision()
>          elif len(revision) != 1:
>              raise errors.BzrCommandError('bzr revert --revision takes exactly 1 argument')
> @@ -3068,7 +3088,7 @@
>              rev_id = revision[0].in_history(tree.branch).rev_id
>          pb = ui.ui_factory.nested_progress_bar()
>          try:
> -            tree.revert(file_list, 
> +            tree.revert(file_list,
>                          tree.branch.repository.revision_tree(rev_id),
>                          not no_backup, pb, report_changes=True)
>          finally:

Something is definately wrong in this bundle, there are changes here for
revert and other commands.

> === modified file 'bzrlib/check.py'
> --- bzrlib/check.py     2007-08-22 22:32:23 +0000
> +++ bzrlib/check.py     2007-09-28 08:18:16 +0000
> @@ -32,6 +32,7 @@
>  # raising them.  If there's more than one exception it'd be good to see them
>  # all.
>  
> +from bzrlib import repository as _mod_repository
>  from bzrlib.errors import BzrCheckError
>  import bzrlib.ui
>  from bzrlib.trace import note
> @@ -53,6 +54,10 @@
>          # maps (file-id, version) -> sha1; used by InventoryFile._check
>          self.checked_texts = {}
>          self.checked_weaves = {}
> +        self.revision_versions = _mod_repository._RevisionTextVersionCache(
> +            self.repository)
> +        self.unreferenced_ancestors = set()
> +        self.inconsistent_parents = []
>  
>      def check(self):
>          self.repository.lock_read()
> @@ -64,36 +69,39 @@
>              self.inventory_weave = self.repository.get_inventory_weave()
>              self.plan_revisions()
>              revno = 0
> -            self.check_weaves()
>              while revno < len(self.planned_revisions):
>                  rev_id = self.planned_revisions[revno]
>                  self.progress.update('checking revision', revno,
>                                       len(self.planned_revisions))
>                  revno += 1
>                  self.check_one_rev(rev_id)
> +            # check_weaves is done after the revision scan so that
> +            # revision_versions is pre-populated
> +            self.check_weaves()
>          finally:
>              self.progress.finished()
>              self.repository.unlock()
>  
>      def plan_revisions(self):
>          repository = self.repository
> -        self.planned_revisions = set(repository.all_revision_ids())
> +        self.planned_revisions = repository.all_revision_ids()
>          self.progress.clear()
>          inventoried = set(self.inventory_weave.versions())
> -        awol = self.planned_revisions - inventoried
> +        awol = set(self.planned_revisions) - inventoried
>          if len(awol) > 0:
>              raise BzrCheckError('Stored revisions missing from inventory'
>                  '{%s}' % ','.join([f for f in awol]))
> -        self.planned_revisions = list(self.planned_revisions)
>  
>      def report_results(self, verbose):
>          note('checked repository %s format %s',
>               self.repository.bzrdir.root_transport,
>               self.repository._format)
>          note('%6d revisions', self.checked_rev_cnt)
> +        note('%6d versionedfiles', len(self.checked_weaves))

I suggest this say '%6d file-ids'

> @@ -113,6 +121,19 @@
>                      note('      %s should be in the ancestry for:', link)
>                      for linker in linkers:
>                          note('       * %s', linker)
> +            if verbose:
> +                for file_id, revision_id in self.unreferenced_ancestors:
> +                    note('unreferenced ancestor: {%s} in %s', revision_id,
> +                        file_id)

This is a critical error, it should error, not output a warning.

> +        if len(self.inconsistent_parents):
> +            note('%6d inconsistent parents', len(self.inconsistent_parents))
> +            if verbose:
> +                for info in self.inconsistent_parents:
> +                    revision_id, file_id, found_parents, correct_parents = info
> +                    note('      * %s version %s has parents %r '
> +                         'but should have %r'
> +                         % (file_id, revision_id, found_parents,
> +                             correct_parents))

This however, is fine.

> ...

> +            result = w.find_bad_ancestors(
> +                self.planned_revisions,
> +                self.revision_versions.get_text_version,
> +                weave_id,
> +                revision_parents,
> +                self.repository.get_graph())
> +            for revision_id in result.iterkeys():
> +                self.unreferenced_ancestors.add((weave_id, revision_id))
>              self.checked_weaves[weave_id] = True

Why do you pass the weave id back into a method on the weave ? This
*really* looks like its barely related to the weave at all.

>      def _check_revision_tree(self, rev_id):
>          tree = self.repository.revision_tree(rev_id)
> +        self.revision_versions.add_revision_text_version(tree)

I'd pluralise this method.

>          inv = tree.inventory
>          seen_ids = {}
>          for file_id in inv:

But as it seems to be doing the same loop as this, for locality of
reference why not just accrue the id:versions in the loop and supply to
the cache afterwards.

> === modified file 'bzrlib/errors.py'
> --- bzrlib/errors.py    2007-09-13 01:54:49 +0000
> +++ bzrlib/errors.py    2007-09-25 06:29:46 +0000
> @@ -95,7 +95,11 @@
>          try:
>              fmt = self._get_format_string()
>              if fmt:
> -                s = fmt % self.__dict__
> +                d = (self.__dict__)
> +                # special case: python2.5 puts the 'message' attribute in a
> +                # slot, so it isn't seen in __dict__
> +                d['message'] = getattr(self, 'message', 'dummy message')
> +                s = fmt % d
>                  # __str__() should always return a 'str' object
>                  # never a 'unicode' object.
>                  if isinstance(s, unicode):

This looks entirely unrelated.

> @@ -783,8 +787,8 @@
>      # New code should prefer to raise specific subclasses
>      def __init__(self, message):
>          # Python 2.5 uses a slot for StandardError.message,
> -        # so use a different variable name
> -        # so it is exposed in self.__dict__
> +        # so use a different variable name.  We now work around this in
> +        # BzrError.__str__, but this member name is kept for compatability.
>          self.msg = message

And this comment here doesn't note that using '%(message)s %(msg)s' will
fail, so even if the code isn't new, its limited in an undocumented way.

> === modified file 'bzrlib/reconcile.py'
> --- bzrlib/reconcile.py 2007-09-12 04:21:51 +0000
> +++ bzrlib/reconcile.py 2007-09-28 04:40:46 +0000
> @@ -20,9 +20,14 @@
>  __all__ = ['reconcile', 'Reconciler', 'RepoReconciler', 'KnitReconciler']
>  
>  
> -from bzrlib import ui
> +from bzrlib import (
> +    errors,
> +    graph,
> +    ui,
> +    repository,
> +    )
>  from bzrlib.trace import mutter
> -from bzrlib.tsort import TopoSorter
> +from bzrlib.tsort import TopoSorter, topo_sort
>  
>  
>  def reconcile(dir, other=None):
> @@ -276,8 +281,8 @@
>          """Perform the steps to reconcile this repository."""
>          if self.thorough:
>              self._load_indexes()
> -            # knits never suffer this
>              self._gc_inventory()
> +            self._fix_text_parents()
>  
>      def _load_indexes(self):
>          """Load indexes for the reconciliation."""
> @@ -337,3 +342,45 @@
>          self.garbage_inventories = len(garbage)
>          for revision_id in garbage:
>              mutter('Garbage inventory {%s} found.', revision_id)
> +
> +    def _fix_text_parents(self):
> +        """Fix bad versionedfile parent entries.
> +
> +        It is possible for the parents entry in a versionedfile entry to be
> +        inconsistent with the values in the revision and inventory.
> +
> +        This method finds entries with such inconsistencies, corrects their
> +        parent lists, and replaces the versionedfile with a corrected version.
> +        """
> +        transaction = self.repo.get_transaction()
> +        revision_parents = repository._RevisionParentsProvider(self.repo)
> +        revision_graph = graph.Graph(revision_parents)

Why isn't this just
revision_graph = self.repo.get_graph() ?

> +            if len(versions_with_bad_parents) == 0:
> +                continue

I think from here to the end could be a separate method.

> +            new_vf = self.repo.weave_store.get_empty('temp:%s' % file_id,
> +                self.transaction)
> +            new_parents = {}
> +            for version in vf.versions():
> +                if version in versions_with_bad_parents:
> +                    parents = versions_with_bad_parents[version][1]
> +                else:
> +                    parents = vf.get_parents(version)
> +                new_parents[version] = parents
> +            for version in topo_sort(new_parents.items()):
> +                new_vf.add_lines(version, new_parents[version],
> +                                 vf.get_lines(version))
> +            self.repo.weave_store.copy(new_vf, file_id, self.transaction)
> +            self.repo.weave_store.delete('temp:%s' % file_id, self.transaction)

> === modified file 'bzrlib/repofmt/knitrepo.py'
> --- bzrlib/repofmt/knitrepo.py  2007-09-16 23:48:15 +0000
> +++ bzrlib/repofmt/knitrepo.py  2007-09-26 10:47:10 +0000
> @@ -69,6 +69,11 @@
>      """Knit format repository."""
>  
>      _serializer = xml5.serializer_v5
> +    def __init__(self, _format, a_bzrdir, control_files, _revision_store,
> +                 control_store, text_store):
> +        MetaDirRepository.__init__(self, _format, a_bzrdir, control_files,
> +                                   _revision_store, control_store, text_store)
> +        self._reconcile_fixes_text_parents = True

PEP-8 - newline between _serializer and the def __init__.

>  
> +    def find_bad_ancestors(self, revision_ids, file_id, versionedfile,
> +                           revision_versions):
> +        """Search the versionedfile for ancestors that are not referenced.
> +
> +        The graph of a given versionedfile should be a subset of the graph
> +        described by the repository's revisions.  One possible deviation is
> +        if a text's parents are not a subset of its revision's parents'
> +        last-modified revisions.  This deviation prevents
> +        fileids_altered_by_revision_ids from correctly determining which
> +        revisions of each text need to be fetched.
> +
> +        This method detects this case.
> +
> +        :param revision_ids: The revisions to scan for deviations
> +        :file_id: The file-id of the versionedfile to scan
> +        :versionedfile: The versionedfile to scan
> +        :revision_versions: A dict that is a cache of last-modified revisions
> +            of files for each version
> +        :parents_provider: An implementation of ParentsProvider to use for
> +            determining the revision graph's ancestry.
> +            _RevisionParentsProvider is recommended for this purpose.
> +        """
> +        graph = self.get_graph()
> +        return versionedfile.find_bad_ancestors(
> +            revision_ids,
> +            revision_versions.get_text_version,
> +            file_id,
> +            graph,
> +            graph)

This method seems rather pointless now.

> +class _RevisionTextVersionCache(object):
> +    """A cache of the versionedfile versions for revision and file-id"""
> +
> +    def __init__(self, repository):
> +        self.repository = repository
> +        self.revision_versions = {}
> +
> +    def add_revision_text_version(self, tree):
> +        """Cache text version data from the supplied revision tree"""
> +        inv_revisions = {}
> +        for path, entry in tree.iter_entries_by_dir():
> +            inv_revisions[entry.file_id] = entry.revision
> +        self.revision_versions[tree.get_revision_id()] = inv_revisions
> +        return inv_revisions
> +
> +    def get_text_version(self, file_id, revision_id):
> +        """Determine the text version for a given file-id and revision-id"""
> +        try:
> +            inv_revisions = self.revision_versions[revision_id]
> +        except KeyError:
> +            tree = self.repository.revision_tree(revision_id)
> +            inv_revisions = self.add_revision_text_version(tree)
> +        return inv_revisions.get(file_id)

This object does not look like a cache; it looks like a dict that will
autopopulate when you ask for something it does not know. So I suggest
not calling it a cache ;). 

> +class _RevisionParentsProvider(object):
> +    """A parents provider that uses a repositoy's revision objects.
> +
> +    Should only be used when checking for corruption.
> +
> +    For uncorrupt repositories, should give the same results as the repo's
> +    get_parents implementation, except much more slowly.
> +    """
> +    def __init__(self, repo):
> +        self._repo = repo
> +        self._memoized = {}
> +
> +    def get_parents(self, revision_ids):
> +        parents_list = []
> +        for revision_id in revision_ids:
> +            try:
> +                parents = self._memoized[revision_id]
> +            except KeyError:
> +                if revision_id == _mod_revision.NULL_REVISION:
> +                    parents = []
> +                else:
> +                    try:
> +                        revision = self._repo.get_revision(revision_id)
> +                        parents = revision.parent_ids
> +                    except errors.NoSuchRevision:
> +                        parents = None
> +                    else:
> +                        if len(parents) == 0:
> +                            parents = [_mod_revision.NULL_REVISION]
> +                self._memoized[revision_id] = parents
> +            parents_list.append(parents)
> +        return parents_list

Is this class now used or needed?

> === modified file 'bzrlib/tests/repository_implementations/__init__.py'
> --- bzrlib/tests/repository_implementations/__init__.py 2007-08-17 05:16:14 +0000
> +++ bzrlib/tests/repository_implementations/__init__.py 2007-09-28 14:07:01 +0000
> @@ -94,11 +94,410 @@
>                  relpath, format=format)
>  
>  

> +class BrokenRepositoryTestProviderAdapter(TestScenarioApplier):
> +
> +    scenario_classes = [
> +        FileParentIsNotInRevisionAncestryScenario,
> +        FileParentHasInaccessibleInventoryScenario,
> +        FileParentsNotReferencedByAnyInventoryScenario,
> +        TooManyParentsScenario,
> +        ClaimedFileParentDidNotModifyFileScenario,
> +        IncorrectlyOrderedParentsScenario,
> +        ]
> +
> +    def __init__(self):
> +        TestScenarioApplier.__init__(self)
> +        self.scenarios = [
> +            (s.__name__, {'scenario_class': s}) for s in
> self.scenario_classes]

I don't think a class is needed here. TestScenarioApplier is designed to
be usable without subclassing:
 applier = TestScenarioApplier()
 applier.scenarios = (s.__name__, {'scenario_class': s}) for s in
      [
        FileParentIsNotInRevisionAncestryScenario,
        FileParentHasInaccessibleInventoryScenario,
        FileParentsNotReferencedByAnyInventoryScenario,
        TooManyParentsScenario,
        ClaimedFileParentDidNotModifyFileScenario,
        IncorrectlyOrderedParentsScenario,
        ])

>  def test_suite():
>      result = TestSuite()
>      test_repository_implementations = [
>          'bzrlib.tests.repository_implementations.test_break_lock',
> +        # note that test_broken is intentionally excluded from this list; it is
> +        # handled further down.

You could make this more clear, and generic by creating all the appliers
you need, then making a list of tuples:
[(appliers, test_names), ...]

The stock tests are
((format_adapter,), [names])
The broken repo tests are
((format_adapter, broken_rep_applier), [test_broken_repo])

Then its fed straight into:
....
because theres a bunch of duplicate code at the moment.

> === modified file 'bzrlib/tests/repository_implementations/test_reconcile.py'
> --- bzrlib/tests/repository_implementations/test_reconcile.py   2007-08-22 05:28:32 +0000
> +++ bzrlib/tests/repository_implementations/test_reconcile.py   2007-09-28 14:09:52 +0000
> @@ -14,7 +14,7 @@
>  # along with this program; if not, write to the Free Software
>  # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
>  
> -"""Tests for reconiliation of repositories."""
> +"""Tests for reconciliation of repositories."""
>  
>  
>  import bzrlib
> @@ -23,10 +23,11 @@
>  from bzrlib.reconcile import reconcile, Reconciler
>  from bzrlib.revision import Revision
>  from bzrlib.tests import TestSkipped
> -from bzrlib.tests.repository_implementations.test_repository import TestCaseWithRepository
> +from bzrlib.tests.repository_implementations.test_repository import (
> +    TestCaseWithRepository,
> +    )
>  from bzrlib.transport import get_transport
>  from bzrlib.uncommit import uncommit
> -from bzrlib.workingtree import WorkingTree
>  
>  
>  class TestReconcile(TestCaseWithRepository):
> @@ -374,3 +375,29 @@
>          repo = d.open_repository()
>          self.checkUnreconciled(d, repo.reconcile())
>          self.checkUnreconciled(d, repo.reconcile(thorough=True))
> +
> +
> +
> +#commiting rev X, row 0 local change, row 1 no local change against at least one parent
> +#rev parents  X.revision inv.revision foreach parent     knit parents   label
> +#revA          X         revA                            revA            parent-changed
> +#revA          A         revA                            NO ENTRY
> +#revA          X         revB                            revB            parent-changed
> +#revA          A         revB                            NO ENTRY
> +#revA  revB    X         revC  revC                      revC            parent changed (C in ancestory of A,B)
> +#revA  revB    C         revC  revC                      NO ENTRY
> +#revA  revB    X         revA  revC                      revA            left-side change (C in ancestry of A)
> +#revA  revB    A         revA  revC                      NO ENTRY
> +#revA  revB    X         revC  revB                      revB            right-side change (C in ancestry of B)
> +#revA  revB    B         revC  revB                      NO ENTRY
> +#revA  revB    X         revA  revB                      revA revB       both-side change no prior join
> +#revA  revB    X         revA  revB                      revA revB       both-side change no prior join (that is, we always record a change in this case)
> +#revA  revB    X         revA  revB                      revA            both-side change, one side pre-merged (A merged B)
> +#revA  revB    A         revA  revB                      NO ENTRY       
> +#revA  revB    X         revC  revD                      revC revD       both-side parent change no prior join
> +#revA  revB    X         revC  revD                      revC revD       both-side parent change no prior join (that is we always record a change in this case)
> +#revA  revB    X         revC  revD                      revC            both sides parent change, C merged D
> +#revA  revB    C         revC  revD                      NO ENTRY

This matrix should go, or be documented for other readers.

> +        
> +    def test_check_error(self):
> +        # This has a member called 'message', which is problematic in
> +        # python2.5 because that is a slot on the base Exception class
> +        e = errors.BzrCheckError('example check failure')
> +        self.assertEqual(
> +            "Internal check failed: example check failure",
> +            str(e))
> +        self.assertTrue(e.internal_error)

I don't think this test needs the commentary: all our exceptions should
be tested that they render successfully.

> === modified file 'bzrlib/versionedfile.py'
> --- bzrlib/versionedfile.py     2007-09-20 06:12:51 +0000
> +++ bzrlib/versionedfile.py     2007-09-28 05:53:43 +0000
> @@ -37,6 +37,7 @@
>  
>  from bzrlib.inter import InterObject
>  from bzrlib.textmerge import TextMerge
> +from bzrlib.trace import mutter
>  
>  
>  class VersionedFile(object):
> @@ -495,6 +496,111 @@
>                      b_marker=TextMerge.B_MARKER):
>          return PlanWeaveMerge(plan, a_marker, b_marker).merge_lines()[0]
>  
> +    def calculate_parents(self, revision_id, get_text_version, file_id,
> +            parents_provider, repo_graph, get_inventory):
> +        text_revision = get_text_version(file_id, revision_id)
> +        if text_revision is None:
> +            return None
> +        parents_of_text_revision = parents_provider.get_parents(
> +            [text_revision])[0]
> +        parents_from_inventories = []
> +        for parent in parents_of_text_revision:
> +            if parent == revision.NULL_REVISION:
> +                continue
> +            try:
> +                inventory = get_inventory(parent)
> +            except errors.RevisionNotPresent:
> +                pass
> +            else:
> +                introduced_in = inventory[file_id].revision
> +                parents_from_inventories.append(introduced_in)
> +        mutter('%r:%r introduced in: %r',
> +               file_id, revision_id, parents_from_inventories)
> +        heads = set(repo_graph.heads(parents_from_inventories))
> +        mutter('    heads: %r', heads)
> +        new_parents = []
> +        for parent in parents_from_inventories:
> +            if parent in heads and parent not in new_parents:
> +                new_parents.append(parent)
> +        return new_parents
> +
> +    def check_parents(self, revision_ids, get_text_version, file_id,
> +            parents_provider, repo_graph, get_inventory):
> +        result = {}
> +        for num, revision_id in enumerate(revision_ids):
> +            correct_parents = self.calculate_parents(revision_id,
> +                    get_text_version, file_id, parents_provider, repo_graph,
> +                    get_inventory)
> +            if correct_parents is None:
> +                continue
> +            text_revision = get_text_version(file_id, revision_id)
> +            knit_parents = self.get_parents(text_revision)
> +            if correct_parents != knit_parents:
> +                result[revision_id] = (knit_parents, correct_parents)
> +        mutter('    RESULT: %r', result)
> +        return result
> +
> +    def find_bad_ancestors(self, revision_ids, get_text_version, file_id,
> +            parents_provider, repo_graph):
> +        """Search this versionedfile for ancestors that are not referenced.
> +
> +        One possible deviation is if a text's parents are not a subset of its
> +        revision's parents' last-modified revisions.  This deviation prevents
> +        fileids_altered_by_revision_ids from correctly determining which
> +        revisions of each text need to be fetched.
> +
> +        This method detects this case.
> +
> +        :param revision_ids: The revisions to scan for deviations
> +        :param file_id: The file-id of the versionedfile to scan
> +        :param get_text_version: a callable that takes two arguments,
> +            file_id and a revision_id, and returns the id of text version of
> +            that file in that revision.
> +        :param parents_provider: An implementation of ParentsProvider to use
> +            for determining the revision graph's ancestry.
> +            _RevisionParentsProvider is recommended for this purpose.
> +
> +        :returns: a dict mapping bad parents to a set of revisions they occur
> +            in.
> +        """
> +        result = {}
> +        from bzrlib.trace import mutter
> +        for num, revision_id in enumerate(revision_ids):
> +
> +            #if revision_id == 'broken-revision-1-2': import pdb; pdb.set_trace()
> +            #if revision_id == 'broken-revision-1-2':
> +            #    result.setdefault('parent-1',set()).add('broken-revision-1-2')
> +            #    result.setdefault('parent-2',set()).add('broken-revision-1-2')
> +            text_revision = get_text_version(file_id, revision_id)
> +            if text_revision is None:
> +                continue
> +
> +            file_parents = parents_provider.get_parents([text_revision])[0]
> +            revision_parents = set()
> +            for parent_id in file_parents:
> +                try:
> +                    revision_parents.add(get_text_version(file_id, parent_id))
> +                # Skip ghosts (this means they can't provide texts...)
> +                except errors.RevisionNotPresent:
> +                    continue
> +            # XXX:
> +            knit_parents = set(self.get_parents(text_revision))
> +            unreferenced = knit_parents.difference(revision_parents)
> +            for unreferenced_id in unreferenced:
> +                result.setdefault(unreferenced_id, set()).add(text_revision)
> +
> +            correct_parents = tuple(repo_graph.heads(knit_parents))
> +            spurious_parents = knit_parents.difference(correct_parents)
> +            for spurious_parent in spurious_parents:
> +                result.setdefault(spurious_parent, set()).add(text_revision)
> +            # XXX: false positives
> +            #text_parents = self.get_parents(text_revision)
> +            #if text_parents != file_parents:
> +            #    for text_parent in text_parents:
> +            #        result.setdefault(text_parent, set()).add(text_revision)
> +        mutter('find_bad_ancestors: %r', result)
> +        return result
> +

These methods are definately in the wrong place.

In summary there are three methods:
calculate_parents
check_parents
find_bad_ancestors

I don't know whats different between the last two - why are there three
methods and not two?

In terms of the right home for this, VersionedFile is the wrong place,
because it conflates delta storage with per-file graph caching. Knits
are not able to decouple these without a disk change, but packs do
decouple (at least on disk) this.

CommitBuilder currently is the right place to calculate what the
per-file graph parents for a text version are; I'm not sure what the
right place for versioned file is; but the fact that calculate_parents
does not use self at all, and check_parents only uses self to get the
parents from the knit, and to access calculate_parents is a pretty clear
smell to me.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071002/6040d005/attachment-0001.pgp