[MERGE] Nested trees: CompositeTree

Tue Apr 14 14:55:52 BST 2009

Aaron Bentley wrote:

> To an extent, this code is a work-in-progress, but it does pass all
> tests, even with the later threads merged in, so I think it's ready for
> review.

There's a lot here to digest and it taking me time. Here's a
work-in-progress review.

Some opening comments .... I don't currently have an opinion on
whether this overall approach is the right one or not. Furthermore,
I doubt I'll know that in weeks, let alone days. But I'm happy to
see us progressing a considered solution over no solution for this
important missing feature. We'll undoubtedly learn a lot from having
something working and for complex problems like this, evolutionary
design & implementation may well give us a better answer than any
we can imagine now. Let's be sure though to clearly articulate our
choices and minimise public interfaces when we can.

Also, code quality *feels* a little below the standard I usually
see in your patches. To be more specific, the code needs more
explanatory comments in exception cases and docstrings need
some attention. In particular, I'd like to see "path" parameters
clearly documented w.r.t. absolute vs relative and when relative,
whether they are relative to the top of the base tree or relative
to the subtree. It might well be obvious when you're knee deep in
the code but it's not obvious to a casual reviewer/maintainer.
Don't get me wrong - the code is good. But *you* usually send in
great and this isn't there yet.

I still need to review CompsiteInventory, some methods in
CompositeTree and the tests. I think there's enough changes
suggested/requested below for me to vote resubmit and not just
comment though.

bb:resubmit

> +    def __init__(self, root_tree, root_branch):
> +        self.root_tree = root_tree
> +        self.root_branch = root_branch
> +        self._clear_cached_trees()
> +        self.lock_count = 0
> +        self.lock_type = None
> +
> +    def _clear_cached_trees(self):
> +        self._subtrees_by_relpath = {}
> +        self._subtrees_by_compositepath = {}
> +        self._register_tree_compositepath(self.root_tree, '')
> +        self._all_trees_scanned = False
> +        self.id_tree_branch = {}
> +        self._locked_trees = []

Is there a good reason for lock_count & lock_type to be set/reset in
one place while _locked_trees is set/reset inside _clear_cached_trees()?
I'd probably feel more comfortable if clear_cached_trees() didn't
touch the locking-related attributes, but resetting all of them would
be OK as well.

> +    def paths_info(self, paths):
> +        """Return path, relpath, inventory entry, and tree for input paths
> +
> +        :return: a list of (path, relpath, entry, tree) tuples
> +        """
> +        self._must_be_locked()
> +        return [self._path_info(p) for p in paths]
> +
> +    def _path_info(self, path):
> +        """Find the relpath, tree and inventory entry for a path."""

These are examples of docstrings which should be clearer IMO. Which paths
are relative and relative to what exactly? entry can actually be returned
as None but when that happens isn't documented and ought to be. The ordering
implied by the second docstring is wrong as well.

> +            if node is None:
> +                return path, osutils.pathjoin(*relpath), None, tree
> +            try:
> +                kind = tree.kind(node.file_id)
> +            except errors.NoSuchFile:
> +                pass

I'm probably missing something obvious here but I don't see how
NoSuchFile can be raised here. If it is, why is it OK to ignore it?

> +    def get_path_tree(self, path):
> +        """Determine the tree containing a path and the tree's pathname."""

Some of the method names feel overly short. In particular, get_tree_path()
and get_path_tree() could be confused and don't return simple values like I
first expected they would on seeing their names. I don't have great
alternatives but as a starting point, maybe something like:

* get_tree_path() -> get_tree_and_treepath()
* get_path_tree() -> get_pathinfo_for_tree_containing()

> +            if path == '':
> +                raise AssertionError('Subpath cannot be empty.')

You want subpath in this test, not path.

> +    def iter_paths_by_tree(self, paths):
> +        tree_path = None

Needs a docstring.

> +    def _scan_subtrees(self):
> +        self._must_be_locked()
> +        def do_scan(prefix, relpath, tree, containing_tree, branch):
> +            self._register_both(tree, prefix, containing_tree, relpath)
> +            for path, file_id in tree.iter_references():
> +                tree_path = tree.id2path(file_id)

Can tree ever be None in these last 2 lines?

> +                do_scan(composite_path, tree_path, subtree, tree, subbranch)
> +        do_scan('', '', self.root_tree,  None, self.root_branch)

One too many spaces before None.

> +    def _register_both(self, tree, compositepath, containing_tree, relpath):
> +        self._register_tree_compositepath(tree, compositepath)
> +        # special case root
> +        if tree is not None:
> +            self._register_tree_relpath(tree, containing_tree, relpath)
> +

Why is it OK to register_tree_compositepath here if tree is None?

> +        subbranch = branch.reference_parent(file_id, path)
> +        try:
> +            subtree = self._get_tree_by_relpath(tree, path)
> +        except KeyError:
> +            subtree = tree.get_nested_tree(file_id, branch)

There's a lot of code (snipped) under that KeyError and I don't understand
what's it doing or why it's there. Please add some comments explaining that.

> +class CompositeInventory(object):
> +    """An inventory that combines the inventories of several subtrees.
> +
> +    Intended for use as CompositeTree.inventory, not for direct instantiation.
> +    """

This class isn't tested directly and ought to be private. Otherwise, I'm yet
to review it.

> +class CompositeTree(object):
> +    """Implements parts of Tree interface in terms of subtrees.
> +
> +    Essentially, this is a by-value representation of a set of nested
> +    subtrees.
> +    Care should be taken in its use-- not suitable for every problem.
> +    """

Can you elaborate here as to which problems it is suited for vs the
ones where it is not? (A link to an email thread or IRC chat would be
better than nothing.)

> +    def get_file_text(self, file_id, path=None):
> +        """See Tree.get_file_text."""
> +        tree, relpath = self._find_tree_path(file_id, path)
> +        return tree.get_file_text(file_id, relpath)
> +

WorkingTree.get_file_text() now has a filtered=True optional parameter. I
suspect you need it as well for content filtering to work as expected
on nested trees.

> +    def path2id(self, path):
> +        """See Tree.path2id."""
> +        # FIXME -- shouldn't return missing files
> +        return self.inventory.path2id(path)

Is this FIXME still relevant? I couldn't see why it was there.

> +    def conflicts(self):
> +        """See WorkingTree.conflicts."""
> +        # FIXME adjust paths in conflicts
> +        conflicts = []
> +        for prefix, tree in self._nested_trees.all_trees().iteritems():
> +            conflicts.extend(tree.conflicts())
> +        return _mod_conflicts.ConflictList(conflicts)

It looks like this FIXME is still required. It might be worth adding
a test for this.

> +    def kind(self, file_id):
> +        """See Tree.kind"""
> +        # Doing a lookup on file_id twice? Hmm.
> +        return self._nested_trees.get_tree_path(file_id)[0].kind(file_id)

I think get_tree() is better than get_tree_path()[0] here.
The comment can go as well?

> +    def filter_unversioned_files(self, paths):
> +        """See Tree.filter_unversioned_files."""
> +        unversioned = set()
> +        # Reverse sort should guarantee that all subtree paths are processed
> +        # before tree paths.  So if path.startswith(tree_path), we're in
> +        # the same tree.
> +        for tree, tree_path, subpaths in \
> +            self._nested_trees.iter_paths_by_tree(paths):

That comment ought to be moved into the iter_paths_by_tree() method
I suspect.

> +    def _comparison_data(self, entry, path):
> +        tree, prefix, path = self._nested_trees.get_path_tree(path)
> +        kind, executable, stat = tree._comparison_data(entry, path)
> +        if kind == 'tree-reference':
> +            kind = 'directory'

When can that last if test succeed? (I couldn't think of a case.)

> +    def get_file_sha1(self, file_id, path=None, stat_value=None):
> +        """See Tree.get_file_sha1."""
> +        return self._nested_trees.get_tree(file_id).get_file_sha1(
> +            file_id, None, stat_value)
> +
> +    def get_file_mtime(self, file_id, path=None):
> +        """See Tree.get_file_mtime."""
> +        return self._nested_trees.get_tree(file_id).get_file_mtime(
> +            file_id, None)

John has been promoting a policy of explicitly naming optional parameters
rather than relying on their position being unchanged. I think you should
follow that convention in these methods.

> +    def id2path(self, file_id):
> +        """See Tree.id2path."""
> +        return self._nested_trees.get_tree_path(file_id)[1]

This is wrong isn't it? get_tree_path()[1] is the path of the tree
when we want to return the path of the file-id, perhaps within a tree, here.

> +    def apply_inventory_delta(self, changes):
> +        """Apply an inventory delta to a set of subtrees.

Do we want/need to use lock decorators on methods like this?
If not, we probably ought to explain why in the class docstring.

> +                if new_tree_path != '':
> +                    new_path = new_path[len(new_tree_path)+1:]
> +                new_root_id = new_tree.get_root_id()
> +            else:
> +                new_tree_path = None
> +            if old_path is not None:
> +                old_tree, old_tree_path = \
> +                    self._nested_trees.get_tree_path(file_id)
> +                if old_tree_path != '':
> +                    old_path = old_path[len(old_tree_path) + 1:]

Very minor grumble - "+1:]" vs " + 1:]". I'm OK either way but let's
be consistent.

> +        for delta, tree in deltas.itervalues():
> +            tree.apply_inventory_delta(delta)

While a little tricky, I couldn't see anything wrong with the
apply_inventory_delta() code. But it does scare me a bit as to
what edge cases it might not be handling. For example, I wonder
whether a kind change from directory to tree-reference or vice-versa
would be handled correctly, where the pre and post set of
tree-references changes. Maybe we just need to be extra diligent
in ensuring we have enough unit tests for stuff like that?

A start,
Ian C.