[MERGE] Faster diff on historical data

Aaron Bentley aaron.bentley at utoronto.ca
Wed Aug 8 18:47:33 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:

bb:resubmit

> People usually use such diffs only on nearby
> revisions, so the knit extraction involves a lot of duplicated work,
> e.g. instead of taking the file at revision 10 and applying one delta,
> it takes revision 1 and unpacks/applies 10 deltas.

It's worse than that, because those intermediate deltas could be used to
speed up the computation of the desired delta.  There's discussion of
this in my diff analysis: doc/developers/diff.txt

That document slightly predates your C sequence matcher, so I'm not
clear whether the performance enhancement from reusing those comparisons
will be an advantage when C sequence matching is merged.

> The attached patch lets it to extract both files in one go. It isn't a
> big win, but it makes diff -r before:x..x on bzr.dev in my repository
> about 0.1 seconds faster (~10% of the total diff time).

Doing this for a 0.1 second improvement doesn't seem worthwhile.  Can
you get numbers for a larger project, where the impact is likely to be
bigger?

I have concerns that larger projects will go through file revisions more
quickly, which would cause them to cross snapshots, which would reduce
the value of this improvement for them.

> === modified file 'bzrlib/diff.py'
> --- bzrlib/diff.py	2007-07-09 07:38:03 +0000
> +++ bzrlib/diff.py	2007-08-01 08:04:20 +0000
> @@ -356,13 +356,33 @@
>      if old_revision_spec is None:
>          old_tree = tree.basis_tree()
>      else:
> -        old_tree = spec_tree(old_revision_spec)
> +        old_tree = None

^^^ this is to ensure that if two trees are retrieved from the same
repo, their repository member has the same identity, right?  I think
that's an unnecessary constraint; the true requirement is that one of
the repositories has both revisions of the text.

>      if (new_revision_spec is None
>          or new_revision_spec.spec is None):
>          new_tree = tree
>      else:
> -        new_tree = spec_tree(new_revision_spec)
> +        if old_revision_spec is not None:
> +            if tree:
> +                old_rev = old_revision_spec.in_store(tree.branch)
> +                new_rev = new_revision_spec.in_store(tree.branch)
> +            else:
> +                old_rev = old_revision_spec.in_store(None)
> +                new_rev = new_revision_spec.in_store(None)

^^^ I suspect this in_store(None) business is untested.  Are there
invocations of bzr diff that can produce it?

> === modified file 'bzrlib/inventory.py'
> --- bzrlib/inventory.py	2007-07-26 21:18:35 +0000
> +++ bzrlib/inventory.py	2007-08-01 08:04:20 +0000
> @@ -649,11 +649,14 @@
>               output_to, reverse=False):
>          """See InventoryEntry._diff."""
>          try:
> -            from_text = tree.get_file(self.file_id).readlines()
> -            if to_entry:
> -                to_text = to_tree.get_file(to_entry.file_id).readlines()
> +            if to_entry and self.file_id == to_entry.file_id:
> +                from_text, to_text = to_tree.get_file_line_list(tree, self.file_id)
>              else:

It looks like this should be an assertion.  I don't think this interface
is meant to compare arbitrary files.

> === modified file 'bzrlib/revisiontree.py'
> --- bzrlib/revisiontree.py	2007-07-17 20:04:13 +0000
> +++ bzrlib/revisiontree.py	2007-08-01 08:04:20 +0000
> @@ -22,6 +22,8 @@
>      osutils,
>      revision,
>      symbol_versioning,
> +    tree,
> +    workingtree,
>      )
>  from bzrlib.tree import Tree
>  
> @@ -77,6 +79,13 @@
>          weave = self._get_weave(file_id)
>          return weave.get_lines(ie.revision)
>  
> +    def _get_file_line_list(self, from_tree, file_id):
> +        file_id = osutils.safe_file_id(file_id)
> +        ie1 = from_tree._inventory[file_id]
> +        ie2 = self._inventory[file_id]
> +        weave = self._get_weave(file_id)
> +        return weave.get_line_list((ie1.revision, ie2.revision))

^^^ This implementation should be on InterRevisionTree.  It should
include a check that the weave includes both revisions, and should fall
back to using the from_tree weave if not.

> +class InterRevisionTree(tree.InterTree):
> +
> +    @staticmethod
> +    def make_source_parent_tree(source, target):
> +        """Change the source tree into a parent of the target."""
> +        revid = source.commit('record tree')
> +        target.branch.repository.fetch(source.branch.repository, revid)
> +        target.set_parent_ids([revid])
> +        return target.basis_tree(), target
> +
> +    _matching_from_tree_format = workingtree.WorkingTreeFormat3()
> +    _matching_to_tree_format = workingtree.WorkingTreeFormat3()
> +    _test_mutable_trees_to_test_trees = make_source_parent_tree
> +
> +    @staticmethod
> +    def is_compatible(source, target):
> +        if (not isinstance(source, RevisionTree) or
> +            not isinstance(target, RevisionTree)):
> +            return False

^^^ This condition excludes DirstateRevisionTree, so it is too strict.

> +        if source._repository != target._repository:
> +            return False

^^^ This condition is too strict.  They are still revision trees if they
do not have a common repo, and it may still be possible to optimize the
comparison.

> === modified file 'bzrlib/tests/test_revisiontree.py'
> --- bzrlib/tests/test_revisiontree.py	2006-09-30 00:25:31 +0000
> +++ bzrlib/tests/test_revisiontree.py	2007-08-01 08:04:20 +0000
> @@ -59,3 +59,24 @@
>          null_tree = self.t.branch.repository.revision_tree(
>              revision.NULL_REVISION)
>          self.assertIs(None, null_tree.inventory.root)
> +
> +    def test__get_file_line_list(self):

^^^ This should be testing Tree.get_file_line_list, and it should be an
intertree_implementation test.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGugG10F+nu1YWqI0RAub2AJ9jUPySz3LZVITAujRgRvtANS/n6ACZAWsf
xgHD7SL2OoD/ARr91aTGMeE=
=jMiu
-----END PGP SIGNATURE-----



More information about the bazaar mailing list