[patch, rfc] speed up selective-file diff

Fri Dec 22 14:59:50 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> These changes speed up `bzr diff FILE` with a specified file.  At the moment we
> walk the whole inventory of both trees, which is unnecessary.
> 
> On the mozilla tree that cuts diff of a single file from 3.4s to 2.6s in ad-hoc
> testing.
> 
> This passes all the tests but I would like to eliminate more of the redundancy
> before it's merged in.

I agree.  Specifically, I think it would be better to allow
iter_entries_by_dir() to take a list of file-ids.  That would allow us
to have only one implementation of the main loop.

> I think this code might be a bit happier returning objects rather than
> tuples if the cost is not too high.

The issue is that if there are a lot of changes, or if iter_changes is
invoked with include_unchanged, object construction time may be
significant.

> We can also make it cleaner by asking entries to compare themselves
> rather than hardcoding a switch on their type in here.

Your patch reminded me why I didn't do that.  _iter_changes compares
versioned files to unversioned files.  (This is quite useful in, e.g.
revert.)  Requiring an InventoryEntry makes that hard.

Also, WorkingTree doesn't update the InventoryEntry when the underlying
file kind is changed.  So even among versioned files, you can't reliably
use InventoryEntries to do comparison.

This is sort of a general problem with the WorkingTree inventory; it
tends to reflect user actions rather than the physical state of the
directory.  Meanwhile, actual WorkingTree methods are much more trustworthy.

> That would take
> us towards getting properties of wt files through their entries rather
> than through the tree.

I think that's the wrong direction.  With dirstate, we're trying to
avoid constructing inventories.  And using InventoryEntries requires us
to retrieve information that we may not need.  So in general, I think
tree methods are better.

> -def _compare_trees(old_tree, new_tree, want_unchanged, specific_file_ids,
> -                   include_root):
> +def _tree_delta_from_change_iter(change_iter, old_tree, include_root):
> +    """Construct a TreeDelta from _iter_changes results"""

I like this.  It might be nice as a public function, so that we can
reuse iter_changes output.

> +    def _iter_changes_specific_files(self, want_unchanged, specific_file_ids):
> +        """List changes between specific file ids"""
> +        # note that this is also called when we're given just the root of the
> +        # branch or tree, because it's expanded to the ids of all versioned
> +        # files
> +        from_tree = self.source
> +        to_tree = self.target
> +        for file_id in specific_file_ids:

You appear to be violating the ordering constraint that children are
emitted after parents.

>          Path is relative to the to_tree.  changed_content is True if the file's
>          content has changed.  This includes changes to its kind, and to
>          a symlink's target.
>  
> +        As a special case, if the file does not exist in the destination tree, then 
> +        the path from the source tree is given instead.

That's incorrect.  The path emitted is the path that the the file WOULD
have if it did exist.  So if the parent directory has been renamed and
the child has been deleted, the parent directory rename is taken into
account when generating the path.

> +        # TODO: perhaps generate objects rather than tuples, so that it's
> +        # easier to see what fields are being accessed, and so that some
> +        # expensive things like the path can be generated as needed.

The path isn't an additional expense; since we have an ordering
constraint that children follow parents, we must read all that data anyhow.

> +    def _compare_entry_content(self, file_id,

As I noted above, I think this is a bad idea.

> +        if from_entry is None and to_entry is None:
> +            # neither one is versioned - not sure what we should do about
> +            # this...

It seems like we can compare unversioned files, so if we're asked to, we
should.

> +        if from_entry.kind != to_entry.kind:
> +            return True

InventoryEntry.kind is not trustworthy.  It can easily be out of date.

> +        elif from_entry.kind == 'file':
> +            # TODO: should possible fix workingtree entries so that they can
> +            # answer these dynamically... mbp 20061221

We already have Tree.kind, which is reliable.  We also want to reduce
our direct use of inventories, so that we're not required to generate
inventories for all trees (e.g. dirstate).  So I don't see the point.
We should just use Trees.

I think the impedance mismatch here is from trying to use the same
structure for archived inventories and live working trees.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFi/Lm0F+nu1YWqI0RAujYAJ4xLCc/nDGfPv/uGNkWc1yjY0UBIACbBy/6
YuBcXr9KjIj4IEOia85GJKg=
=k2PE
-----END PGP SIGNATURE-----