[MERGE] TreeTransform avoids many renames when constructing trees

Ian Clatworthy ian.clatworthy at internode.on.net
Mon Jun 4 13:33:29 BST 2007


Aaron Bentley wrote:

> This patch implements the top-rename strategy for TreeTransform, so that
> building a tree only renames files in the top-level directory.

Cool!! Initial feedback below. I'm yet to get across all of transform.py
and therefore can't be confident that you haven't missed somewhere that
ought to be changed. Even so, I hope the comments below are a useful
partial review ...

> === modified file NEWS
> --- NEWS
> +++ NEWS
> @@ -24,6 +24,10 @@
>      * Revert does not try to preserve file contents that were originally
>        produced by reverting to a historical revision.  (Aaron Bentley)
>  
> +    * TreeTransform avoids many renames when contructing large trees, improving
> +      speed.  3.25x speedups have been observed for construction of
> +      kernel-sized-trees, and checkouts are 1.28x faster.  (Aaron Bentley)
> +
>    BUGFIXES:
>  
>      * ``bzr push`` should only connect to the remote location one time.
> 

+1

> === modified file bzrlib/tests/test_transform.py // last-changed:aaron.bentley@
> ... utoronto.ca-20070603155659-rimfmfeyw7zgbgzm
> --- bzrlib/tests/test_transform.py
> +++ bzrlib/tests/test_transform.py
> @@ -394,7 +394,6 @@
>          self.assertEqual(os.readlink(self.wt.abspath('oz/wizard')),
>                           'wizard-target')
>  
> -
>      def get_conflicted(self):
>          create,root = self.get_transform()
>          create.new_file('dorothy', root, 'dorothy', 'dorothy-id')
> @@ -545,9 +544,11 @@
>          wt = transform._tree
>          transform.new_file('set_on_creation', root, 'Set on creation', 'soc',
>                             True)
> -        sac = transform.new_file('set_after_creation', root, 'Set after creation', 'sac')
> +        sac = transform.new_file('set_after_creation', root,
> +                                 'Set after creation', 'sac')
>          transform.set_executability(True, sac)
> -        uws = transform.new_file('unset_without_set', root, 'Unset badly', 'uws')
> +        uws = transform.new_file('unset_without_set', root, 'Unset badly',
> +                                 'uws')
>          self.assertRaises(KeyError, transform.set_executability, None, uws)
>          transform.apply()
>          self.assertTrue(wt.is_executable('soc'))

+1

> @@ -774,6 +775,122 @@
>          finally:
>              transform.finalize()
>  
> +    def test_rename_count(self):
> +        transform, root = self.get_transform()
[snip]

I'm yet to review these tests. All are new tests so +1 on that, though
someone else (or me later) ought to review them in detail.

> @@ -1115,6 +1232,23 @@
>          self.assertRaises(errors.WorkingTreeAlreadyPopulated, 
>              build_tree, source.basis_tree(), target)
>  
> +    def test_build_tree_rename_count(self):
> +        source = self.make_branch_and_tree('source')
> +        self.build_tree(['source/file1', 'source/dir1/'])
> +        source.add(['file1', 'dir1'])
> +        source.commit('add1')
> +        target1 = self.make_branch_and_tree('target1')
> +        tt = build_tree(source.basis_tree(), target1)
> +        self.assertEqual(2, tt.rename_count)
> +
> +        self.build_tree(['source/dir1/file2'])
> +        source.add(['dir1/file2'])
> +        source.commit('add3')
> +        target2 = self.make_branch_and_tree('target2')
> +        tt = build_tree(source.basis_tree(), target2)
> +        # children of non-root directories should not be renamed
> +        self.assertEqual(2, tt.rename_count)
> +
>  
>  class MockTransform(object):

build-tree() returns a _TransformResults object so I'd suggest 'tr' as
the result variable name instead of 'tt'.

> @@ -1127,6 +1261,7 @@
>                  return True
>          return False
>  
> +
>  class MockEntry(object):
>      def __init__(self):
>          object.__init__(self)
> 
+1

> === modified file bzrlib/transform.py // last-changed:aaron.bentley at utoronto.ca
> ... -20070603163308-r6el1wy9cvj9h6h2
> --- bzrlib/transform.py
> +++ bzrlib/transform.py
> @@ -51,16 +51,21 @@
>  
>  
>  class _TransformResults(object):
> -    def __init__(self, modified_paths):
> +    def __init__(self, modified_paths, rename_count):
>          object.__init__(self)
>          self.modified_paths = modified_paths
> +        self.rename_count = rename_count
>  
>  
>  class TreeTransform(object):
>      """Represent a tree transformation.
>      
>      This object is designed to support incremental generation of the transform,
> -    in any order.  
> +    in any order.
> +
> +    However, it gives optimum performance when parent directories are created
> +    before their contents.  The transform is then able to put child files
> +    directly in their parent directory, avoiding later renames.
>      
>      It is easy to produce malformed transforms, but they are generally
>      harmless.  Attempting to apply a malformed transform will cause an
> @@ -106,6 +111,15 @@
>          self._new_name = {}
>          self._new_parent = {}
>          self._new_contents = {}
> +        # A mapping of transform ids to their limbo filename
> +        self._limbo_files = {}
> +        # A mapping of transform ids to a set of the transform ids of children
> +        # their limbo directory has

s/their/that/

> +        self._limbo_children = {}
> +        # Map transform ids to maps of child filename to child transform id
> +        self._limbo_children_names = {}
> +        # List of transform ids that need to be renamed from limbo into place
> +        self._needs_rename = set()
>          self._removed_contents = set()
>          self._new_executability = {}
>          self._new_reference_revision = {}
> @@ -115,13 +129,14 @@
>          self._removed_id = set()
>          self._tree_path_ids = {}
>          self._tree_id_paths = {}
> +        # Cache of realpath results, to speed up canonical_path
>          self._realpaths = {}
> -        # Cache of realpath results, to speed up canonical_path
> +        # Cache of relpath results, to speed up canonical_path
>          self._relpaths = {}
> -        # Cache of relpath results, to speed up canonical_path
>          self._new_root = self.trans_id_tree_file_id(tree.get_root_id())
>          self.__done = False
>          self._pb = pb
> +        self.rename_count = 0
>  
>      def __get_root(self):
>          return self._new_root
> @@ -165,8 +180,28 @@
>          """Change the path that is assigned to a transaction id."""
>          if trans_id == self._new_root:
>              raise CantMoveRoot
> +        previous_parent = self._new_parent.get(trans_id)
> +        previous_name = self._new_name.get(trans_id)
>          self._new_name[trans_id] = name
>          self._new_parent[trans_id] = parent
> +        if (trans_id in self._limbo_files and
> +            trans_id not in self._needs_rename):
> +            self._rename_in_limbo([trans_id])
> +            self._limbo_children[previous_parent].remove(trans_id)
> +            del self._limbo_children_names[previous_parent][previous_name]
> +
> +    def _rename_in_limbo(self, trans_ids):
> +        """Fix a limbo name so that the right final path is produced.
> +
> +        This means we outsmarted ourselves-- we tried to avoid renaming
> +        these file later by creating them with their final names in their

s/file/files/

> +        final parent.  But now the previous name or parent is no longer
> +        suitable, so we have to rename them.
> +        """
> +        for trans_id in trans_ids:
> +            old_path = self._limbo_files[trans_id]
> +            new_path = self._limbo_name(trans_id, from_scratch=True)
> +            os.rename(old_path, new_path)
>  
>      def adjust_root_path(self, name, parent):
>          """Emulate moving the root by moving all children, instead.
> @@ -330,6 +365,13 @@
>      def cancel_creation(self, trans_id):
>          """Cancel the creation of new file contents."""
>          del self._new_contents[trans_id]
> +        children = self._limbo_children.get(trans_id)
> +        # if this is a limbo directory with children, move them before removing
> +        # the directory
> +        if children is not None:
> +            self._rename_in_limbo(children)
> +            del self._limbo_children[trans_id]
> +            del self._limbo_children_names[trans_id]
>          delete_any(self._limbo_name(trans_id))
>  
>      def delete_contents(self, trans_id):
> @@ -751,11 +793,36 @@
>          self._tree.apply_inventory_delta(inventory_delta)
>          self.__done = True
>          self.finalize()
> -        return _TransformResults(modified_paths)
> +        return _TransformResults(modified_paths, self.rename_count)
>

+1 for above bar spelling issues notes

> -    def _limbo_name(self, trans_id):
> +    def _limbo_name(self, trans_id, from_scratch=False):
>          """Generate the limbo name of a file"""
> -        return pathjoin(self._limbodir, trans_id)
> +        if trans_id in self._limbo_files and not from_scratch:
> +            return self._limbo_files[trans_id]
> +        parent = self._new_parent.get(trans_id)
> +        # if the parent directory is already in limbo (e.g. when building a
> +        # tree), choose a limbo name inside the parent, to reduce further
> +        # renames.
> +        use_direct_path = False
> +        if self._new_contents.get(parent) == 'directory':
> +            filename = self._new_name.get(trans_id)
> +            if filename is not None:
> +                if parent not in self._limbo_children:
> +                    self._limbo_children[parent] = set()
> +                    self._limbo_children_names[parent] = {}
> +                    use_direct_path = True
> +                elif (self._limbo_children_names[parent].get(filename)
> +                      in (trans_id, None)):
> +                    use_direct_path = True

This routine is all good to me bar this elif branch. I kind of expected
the 'None' case being the default and guessed that the 'trans_id' case
might happen if the filename was already there??? And being in that
dictionary with a different value is possible as well given you're
explicitly guarding against that?

That's a long way of saying that a comment would help here IMO.

> +        if use_direct_path:
> +            limbo_name = pathjoin(self._limbo_files[parent], filename)
> +            self._limbo_children[parent].add(trans_id)
> +            self._limbo_children_names[parent][filename] = trans_id
> +        else:
> +            limbo_name = pathjoin(self._limbodir, trans_id)
> +            self._needs_rename.add(trans_id)
> +        self._limbo_files[trans_id] = limbo_name
> +        return limbo_name
>  
>      def _apply_removals(self, inv, inventory_delta):
>          """Perform tree operations that remove directory/inventory names.
> @@ -781,6 +848,8 @@
>                      except OSError, e:
>                          if e.errno != errno.ENOENT:
>                              raise
> +                    else:
> +                        self.rename_count += 1
>                  if trans_id in self._removed_id:
>                      if trans_id == self._new_root:
>                          file_id = self._tree.inventory.root.file_id
> @@ -812,12 +881,15 @@
>                  if trans_id in self._new_contents or \
>                      self.path_changed(trans_id):
>                      full_path = self._tree.abspath(path)
> -                    try:
> -                        os.rename(self._limbo_name(trans_id), full_path)
> -                    except OSError, e:
> -                        # We may be renaming a dangling inventory id
> -                        if e.errno != errno.ENOENT:
> -                            raise
> +                    if trans_id in self._needs_rename:
> +                        try:
> +                            os.rename(self._limbo_name(trans_id), full_path)
> +                        except OSError, e:
> +                            # We may be renaming a dangling inventory id
> +                            if e.errno != errno.ENOENT:
> +                                raise
> +                        else:
> +                            self.rename_count += 1
>                      if trans_id in self._new_contents:
>                          modified_paths.append(full_path)
>                          del self._new_contents[trans_id]

+1

> @@ -1219,10 +1291,11 @@
>              wt.add_conflicts(conflicts)
>          except errors.UnsupportedOperation:
>              pass
> -        tt.apply()
> +        result = tt.apply()
>      finally:
>          tt.finalize()
>          top_pb.finished()
> +    return result
>  
>

This looks like a potential bug to me. result is only set as the last
thing in a long try/finally block. I think it needs to be initialised to
a benign value (None?) prior to the start of the try/finally block. I'd
also like to see the docstring for build_tree() improved to explicitly
explain that a _TransformResult object is returned.

While not directly related to your patch (and therefore not necessary to
land it), here are some other thoughts on some bits of transform.py I've
read over today:

1. finalize(). Putting try/except around os.rmdir() and os.unlink()
   feels more defensive to me than assuming, however good the assumption
   is, that they can't fail. If one rmdir/unlink does fails, do we
   really want to implicitly abort? Would continuing on deleting as many
   other files in limbo as possible be better behaviour?
   With nested files now, will children always get nuked before parents?

2. apply docstring(). My initial thought was that the call to finalize()
   was redundant because it is always done in finally clauses around
   calls to it. But even if true inside transform.py, tests seem
   to assume that apply() calls finalize(), releasing the write lock.
   I'd suggest making that explicit in the apply() docstring in that
   case. __init__ makes it clear that finalize() needs to be called
   so making it clear that apply() does this clarifies the caller
   contract IMO.

3. _build_tree large internal comment: spaces missing in kindof, partof
   and hereis.

4. revert()'s change_reporter parameter. This should be initialised to
   *False*, not *None*. Only the boolean value is tested in the code
   and only a boolean value is passed in from the one place I could
   find calling it with that parameter (workingtree.py).

5. _alter_files() trivial stuff. No need for if/else for setting
   skip_root - the value of the if test is fine. Also a comment
   spelling mistake: lazyily -> lazily.

Once again, apologies for only a partial review. (transform.py is 28
pages and I'd need to be across more of that code than I had time today
to digest before I could be fully confident in the patch.) Having said
that, except for reviewing a batch of new tests, I'm happy with all your
changes bar the minor niggles noted above. Of those, only the
initialization of result in _build_tree() is of risk consequence IMO.
And maybe finalize() being more defensive??

Ian C.



More information about the bazaar mailing list