[MERGE] Make merge more than 2x faster

Fri Jul 13 01:03:25 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> +    def _add_parent(self):
> 
> This seems like it should be called '_add_parents()' since it adds any new
> parents that it finds.

Actually, it only adds one parent:
+    def _add_parent(self):
+        new_parents = self.this_tree.get_parent_ids() + [self.other_rev_id]

I think that add_parents would be a misleading name.  Can I leave it the
way it is?

> As for performance improvements, one issue with DirState as it stands, is that
> it requires 2 passes to update the working inventory and the parent trees
> (rather than a single: here is all your new information).

Ooh, trying to combine those steps could get really ugly.

TreeTransform produces an inventory delta, and it would certainly be
possible to defer applying it until we were ready to set parents.

But setting parents to trees is not efficient anyhow, because it has to
walk the whole tree.  Once we have efficient historical tree comparison,
we will be able to generate inventory deltas cheaply.  So we will be
able to update parent trees using inventory deltas.

> I'm guessing text
> extraction is much more of a problem, but it might be something to think about.
> (You've done more profiling in this area than I have).

LSProf measures _add_parent as 3.95 ticks out of 104.  get_file is 13.74

Part of the problem is that WorkingTree4's.add_parent_tree is completely
suboptimal: it winds up doing set_parent_ids!

I note this comment in dirstate.py:
# We can probably save cycles by reusing parent
# data and doing an incremental update when adding an additional
# parent, but for now the common cases are adding a new parent (merge),
# and replacing completely (commit), and commit is more common: so
# optimise merge later.

Since I don't have the skillz to fix dirstate.py, I'm fixing it in merge.

> Have you tried it with my knit_index_pyrex patch?

No, I'm waiting for the movie to come out.  Seriously, why isn't it
merged yet?

> It should help merge a lot if
> it is reading data from different file knits. (Like merging 100 revisions of
> bzr.dev into an old branch).

I'm spending 20.48 ticks in _load_data.  Squashing that would be fantastic.

> +        try:
> +            pb = ui.ui_factory.nested_progress_bar()
> +            try:
> +                this_repo = self.this_branch.repository
> +                graph = this_repo.get_graph()
> +                revisions = [ensure_null(self.this_basis),
> +                             ensure_null(self.other_basis)]
> +                if NULL_REVISION in revisions:
> +                    self.base_rev_id = NULL_REVISION
> +                else:
> +                    self.base_rev_id = graph.find_unique_lca(*revisions)
> +                    if self.base_rev_id == NULL_REVISION:
> +                        raise UnrelatedBranches()
> +            finally:
> +                pb.finished()
> +        except NoCommonAncestor:
> +            raise UnrelatedBranches()
> 
> ^- It seems like your try/finally for the pb should be outside your try/except
> NoCommonAncestor, though the except NoCommonAncestor seems completely
> unnecessary at this point, since it looks like find_unique_lca returns
> NULL_REVISION rather thain raising NoCommonAncestor.

Good points.

In fact, the progress bar is also unused, so I got rid of that, too.

> +    def _maybe_fetch(self, source, target, revision_id):
> +        if (source.repository.bzrdir.root_transport.base !=
> +            target.repository.bzrdir.root_transport.base):
> +            target.fetch(source, revision_id)
> +
> 
> ^- It is a shame that you have to do this sort of work. It would probably be
> better to put this sort of effort into Repository.fetch() itself.

RepoFetcher.__init__ is already that smart:

        if to_repository.control_files._transport.base ==
from_repository.control_files._transport.base:
            # check that last_revision is in 'from' and then return a
no-operation.
            if last_revision not in (None, NULL_REVISION):
                from_repository.get_revision(last_revision)
            return

The problem is that fetch is expected to raise NoSuchRevision if the
revision is not present in the target repo.  That's why it does
from_repository.get_revision, and then throws away the result.

My code is to avoid that get_revision.

> === modified file bzrlib/workingtree_4.py // last-changed:abentley at panoramicfee
> ... dback.com-20070712170702-kefnhuo926w6cvhz
> --- bzrlib/workingtree_4.py
> +++ bzrlib/workingtree_4.py
> @@ -1488,6 +1488,10 @@
>              return parent_details[1]
>          return None
> 
> +    def get_weave(self, file_id):
> +        return self._repository.weave_store.get_weave(file_id,
> +                self._repository.get_transaction())
> +
> 
> ^- Why is this necessary?

- --merge-type weave needs access to versionedfiles, at least with the
current API.  DirStateRevisionTrees couldn't previously be selected for
use with --merge-type weave, but Merger.revision_tree will now select them.

Here's an updated patch, since I diverged a bit from what you asked for.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGlsFM0F+nu1YWqI0RAok0AJ4kPUBGbYJ04ke35Xkr91CEe459WACfQ1fQ
4ElfMJXs+fE9ffjabdh3PgU=
=OTWa
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast-merge-2.patch
Type: text/x-patch
Size: 108448 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070712/8d0101e3/attachment-0001.bin