[MERGE] Make merge more than 2x faster

Fri Jul 13 03:14:09 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

+1 overall.

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> +    def _add_parent(self):
> 
>> This seems like it should be called '_add_parents()' since it adds any new
>> parents that it finds.
> 
> Actually, it only adds one parent:
> +    def _add_parent(self):
> +        new_parents = self.this_tree.get_parent_ids() + [self.other_rev_id]
> 
> 
> I think that add_parents would be a misleading name.  Can I leave it the
> way it is?

Yep. I just misread what you were doing. You said that 'add_parent_tree' was
suboptimal, and I thought that was because you were adding more than one.

> 
>> As for performance improvements, one issue with DirState as it stands, is that
>> it requires 2 passes to update the working inventory and the parent trees
>> (rather than a single: here is all your new information).
> 
> Ooh, trying to combine those steps could get really ugly.
> 
> TreeTransform produces an inventory delta, and it would certainly be
> possible to defer applying it until we were ready to set parents.
> 
> But setting parents to trees is not efficient anyhow, because it has to
> walk the whole tree.  Once we have efficient historical tree comparison,
> we will be able to generate inventory deltas cheaply.  So we will be
> able to update parent trees using inventory deltas.
> 
>> I'm guessing text
>> extraction is much more of a problem, but it might be something to think about.
>> (You've done more profiling in this area than I have).
> 
> LSProf measures _add_parent as 3.95 ticks out of 104.  get_file is 13.74
> 
> Part of the problem is that WorkingTree4's.add_parent_tree is completely
> suboptimal: it winds up doing set_parent_ids!
> 
> I note this comment in dirstate.py:
> # We can probably save cycles by reusing parent
> # data and doing an incremental update when adding an additional
> # parent, but for now the common cases are adding a new parent (merge),
> # and replacing completely (commit), and commit is more common: so
> # optimise merge later.
> 
> Since I don't have the skillz to fix dirstate.py, I'm fixing it in merge.
> 
>> Have you tried it with my knit_index_pyrex patch?
> 
> No, I'm waiting for the movie to come out.  Seriously, why isn't it
> merged yet?

Pyrex wasn't installed on the PQM, and while I was waiting for it, I switched
to getting my other pyrex code cleaned up and mergable. So that we can land
them both at approx the same time, to expose whatever problems we are going to
have together.

> 
>> It should help merge a lot if
>> it is reading data from different file knits. (Like merging 100 revisions of
>> bzr.dev into an old branch).
> 
> I'm spending 20.48 ticks in _load_data.  Squashing that would be fantastic.

Yeah, that is why I wrote it. I don't think it goes to zero, but it gets quite
a bit better. I was seeing a 2-3x improvement. So maybe 21 => 7?

> 
>> +        try:
>> +            pb = ui.ui_factory.nested_progress_bar()
>> +            try:
>> +                this_repo = self.this_branch.repository
>> +                graph = this_repo.get_graph()
>> +                revisions = [ensure_null(self.this_basis),
>> +                             ensure_null(self.other_basis)]
>> +                if NULL_REVISION in revisions:
>> +                    self.base_rev_id = NULL_REVISION
>> +                else:
>> +                    self.base_rev_id = graph.find_unique_lca(*revisions)
>> +                    if self.base_rev_id == NULL_REVISION:
>> +                        raise UnrelatedBranches()
>> +            finally:
>> +                pb.finished()
>> +        except NoCommonAncestor:
>> +            raise UnrelatedBranches()
> 
>> ^- It seems like your try/finally for the pb should be outside your try/except
>> NoCommonAncestor, though the except NoCommonAncestor seems completely
>> unnecessary at this point, since it looks like find_unique_lca returns
>> NULL_REVISION rather thain raising NoCommonAncestor.
> 
> Good points.
> 
> In fact, the progress bar is also unused, so I got rid of that, too.
> 
> 
>> +    def _maybe_fetch(self, source, target, revision_id):
>> +        if (source.repository.bzrdir.root_transport.base !=
>> +            target.repository.bzrdir.root_transport.base):
>> +            target.fetch(source, revision_id)
>> +
> 
>> ^- It is a shame that you have to do this sort of work. It would probably be
>> better to put this sort of effort into Repository.fetch() itself.
> 
> RepoFetcher.__init__ is already that smart:
> 
>         if to_repository.control_files._transport.base ==
> from_repository.control_files._transport.base:
>             # check that last_revision is in 'from' and then return a
> no-operation.
>             if last_revision not in (None, NULL_REVISION):
>                 from_repository.get_revision(last_revision)
>             return
> 
> The problem is that fetch is expected to raise NoSuchRevision if the
> revision is not present in the target repo.  That's why it does
> from_repository.get_revision, and then throws away the result.
> 
> My code is to avoid that get_revision.

I wonder if 'has_revision()' wouldn't be better. Since get_revision() has to
read revisions.kndx while everything else only reads inventory.kndx.

> 
>> === modified file bzrlib/workingtree_4.py // last-changed:abentley at panoramicfee
>> ... dback.com-20070712170702-kefnhuo926w6cvhz
>> --- bzrlib/workingtree_4.py
>> +++ bzrlib/workingtree_4.py
>> @@ -1488,6 +1488,10 @@
>>              return parent_details[1]
>>          return None
> 
>> +    def get_weave(self, file_id):
>> +        return self._repository.weave_store.get_weave(file_id,
>> +                self._repository.get_transaction())
>> +
> 
>> ^- Why is this necessary?
> 
> --merge-type weave needs access to versionedfiles, at least with the
> current API.  DirStateRevisionTrees couldn't previously be selected for
> use with --merge-type weave, but Merger.revision_tree will now select them.
> 
> Here's an updated patch, since I diverged a bit from what you asked for.
> 
> Aaron

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGlt/mJdeBCYSNAAMRArK1AJ9gSC85MNdDxdUNFPIacGbpVG17VQCfdpUt
dB3TZ/llc7L2m2SBGaZ3NIg=
=prj0
-----END PGP SIGNATURE-----