[MERGE] Fix non-rich to rich-root fetch

Tue Apr 29 08:15:56 BST 2008

Aaron Bentley wrote:
> Hi all,
> 
> This patch solves all the bugs I know about when fetching from any
> non-rich-root repository into any rich-root repository.  I have used an
> earlier version to successfully convert the Bazaar source tree, which
> has ghosts and unique root ids, into rich-root-pack format.

To the extent of my knowledge in this area, this looks ok technically. I
did find it much harder to understand exactly what was going on though
than I anticipated, so I'm requesting some refactoring before this gets
merged. I appreciate that this is the first of a few patches in this
area, so let me know if you think the refactoring requests below are
premature. My thinking is this: if we encounter any bugs upgrading most
bzr repositories in the world to rick-root-pack, then the code needs to
be as clear as it can be, recognising that there is a base level of
complexity here given the numerous corner cases that can arise.

bb:resubmit

>          inventory_weave = self.source.get_inventory_weave()
> -        parent_texts = {}
>          versionedfile = {}

inventory_weave isn't used anywhere in this routine so can go?
versionedfile (as a map) isn't either now.

>          to_store = self.target.weave_store
> -        parent_map = self.source.get_graph().get_parent_map(revs)
> +        graph = self.source.get_graph()
> +        parent_map = graph.get_parent_map(revs)

OK. to_store is set well before it's being used though.

> +        planned_versions = {}
> +        revision_root = {}
>          for tree in self.iter_rev_trees(revs):
>              revision_id = tree.inventory.root.revision
>              root_id = tree.get_root_id()
> -            parents = parent_map[revision_id]
> -            if parents[0] == NULL_REVISION:
> -                parents = ()
> -            if root_id not in versionedfile:
> -                versionedfile[root_id] = to_store.get_weave_or_empty(root_id,
> -                    self.target.get_transaction())
> -            _, _, parent_texts[root_id] = versionedfile[root_id].add_lines(
> -                revision_id, parents, [], parent_texts)
> +            planned_versions.setdefault(root_id, []).append(revision_id)
> +            revision_root[revision_id] = root_id
> +        # Find out which parents we don't already know root ids for
> +        parents = set()
> +        for revision_parents in parent_map.itervalues():
> +            parents.update(revision_parents)
> +        parents.difference_update(revision_root.keys() + [NULL_REVISION])
> +        # Limit to revision present in the versionedfile
> +        parents = graph.get_parent_map(parents).keys()
> +        for tree in self.iter_rev_trees(parents):
> +            root_id = tree.get_root_id()
> +            revision_root[tree.get_revision_id()] = root_id

I think this block of code needs to be a method in its own right,
returning planned_revisions and revision_root. Given your overall
approach of "make it work, make it right, make it fast" and recognising
this is step 1, I didn't spent much time thinking though the performance
implications of the new code. The multiple calls to iter_rev_trees()
stands out though. In any case, we can optimise it later and making the
code a separate method ought to help both in the profiling itself and in
reviewing the faster code if it is indeed necessary to tune it.

> +        for root_id, versions in planned_versions.iteritems():
> +            versionedfile = to_store.get_weave_or_empty(root_id,
> +                self.target.get_transaction())
> +            parent_texts = {}
> +            for revision_id in versions:
> +                if revision_id in versionedfile:
> +                    continue
> +                parents = parent_map[revision_id]
> +                # When a parent revision is a ghost, we guess that its root id
> +                # was unchanged.
> +                parents = tuple(p for p in parents if p != NULL_REVISION
> +                    and revision_root.get(p, root_id) == root_id)
> +                result = versionedfile.add_lines_with_ghosts(
> +                    revision_id, parents, [], parent_texts)
> +                parent_texts[revision_id] = result[2]
>  

This looks ok. I'd appreciate it though if the comment also explained
that parents get dropped if their revision_root doesn't match, and why
you're doing that.

> +class Test1To2Fetch(TestCaseWithTransport):
> +    """Tests for Model1To2 failure modes"""
> +
> +    def do_test(self, first, second):
> +        """Test that fetch works no matter what the set order of revision is.
> +
> +        This test depends on the order of items in a set, which is
> +        implementation-dependant, so we test A, B and then B, A.
> +        """
> +        self.make_tree_and_repo()
> +        self.tree.commit('Commit 1', rev_id=first)
> +        self.tree.commit('Commit 2', rev_id=second)
> +        self.repo.fetch(self.tree.branch.repository, second)
> +
> +    def get_parents(self, file_id, revision_id):
> +        transaction = self.repo.get_transaction()
> +        vf = self.repo.weave_store.get_weave(file_id, transaction)
> +        return vf.get_parents_with_ghosts(revision_id)
> +
> +    def test_fetch_order_AB(self):
> +        """See do_test"""
> +        self.do_test('A', 'B')
> +
> +    def test_fetch_order_BA(self):
> +        """See do_test"""
> +        self.do_test('B', 'A')
> +
> +    def test_fetch_ghosts(self):
> +        self.make_tree_and_repo()
> +        self.tree.commit('first commit', rev_id='left-parent')
> +        self.tree.add_parent_tree_id('ghost-parent')
> +        fork = self.tree.bzrdir.sprout('fork', 'null:').open_workingtree()
> +        fork.commit('not a ghost', rev_id='not-ghost-parent')
> +        self.tree.branch.repository.fetch(fork.branch.repository,
> +                                     'not-ghost-parent')
> +        self.tree.add_parent_tree_id('not-ghost-parent')
> +        self.tree.commit('second commit', rev_id='second-id')
> +        self.repo.fetch(self.tree.branch.repository, 'second-id')
> +        root_id = self.tree.get_root_id()
> +        self.assertEqual(['left-parent', 'ghost-parent', 'not-ghost-parent'],
> +                         self.get_parents(root_id, 'second-id'))
> +
> +    def make_tree_and_repo(self):
> +        self.tree = self.make_branch_and_tree('tree', format='pack-0.92')
> +        self.repo = self.make_repository('rich-repo', format='rich-root-pack')
> +        self.repo.lock_write()
> +        self.addCleanup(self.repo.unlock)
> +

My initial reaction on seeing this class was that it ought to be two:
one covering the fetch_order stuff and another covering the ghost and
changed root issues. The make_tree_and_repo() helper method does tie
them together though, and abstracting that into an abstract class seems
more effort/complexity than it's worth. To "partition" things slightly
better though, I'd like to see:

* make_tree_and_repo() made the first method in the class
* do_test and the two routines using it put right next to each other
  (without get_parents in the middle)

do_test() is probably just a little *too* generic a name as well given
the numerous other tests covered by the class. do_fetch_order_test() or
do_order_test() would be better I feel.

Ian C.