[MERGE] Implement guess-renames

John Arbash Meinel john at arbash-meinel.com
Mon Mar 23 16:02:58 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> This patch implements a guess-renames command.  Assume some other
> process has renamed some of your files.  This will guess what renames
> were performed and update bzr accordingly.
> 

This seems like it would fit well as a plugin, though if you have reason
to make this a core command, I certainly wouldn't refuse. (The only
argument against is the standard "proliferation of commands" one.)

+            for num, (file_id, contents) in enumerate(
+                tree.iter_files_bytes(desired_files)):
+                task.update('Calculating hashes', num, len(file_ids))
+                s = StringIO()
+                s.writelines(contents)
+                s.seek(0)
+                self.add_edge_hashes(s.readlines(), file_id)

^- I think you can use "osutils.chunks_to_lines()" to do this a lot more
efficiently than writing into a StringIO and the splitting. (Assuming
that contents is 'mostly lines' and you want to make sure that it is
'exactly lines'.)

...

+                task.update('Determining hash hits', num, len(paths))
+                my_file = self.tree.get_file(None, path=path)
+                try:
+                    hits = self.hitcounts(my_file.readlines())
+                finally:
+                    my_file.close()

It seems more reasonable to use: self.tree.get_file_lines()
or if that isn't officially a Tree api, then doing:

my_file = self.tree.get_file(...)
try:
  my_lines = my_file.readlines()
finally:
  my_file.close()
hits = self.hitcounts(my_lines)

Mostly because it seems silly to keep the file open for the whole
duration of "hitcounts()" rather than just long enough to read its content.

...

+    @staticmethod
+    def _match_hits(hit_list):
+        """Using a hit list, determin a path-to-fileid map.
^- 'determine'

+
+        The hit list is a list of (count, path, file_id), where count is a
+        (possibly float) number, with higher numbers indicating stronger
+        matches.
+        """
+        seen_file_ids = set()
+        seen_paths = set()
+        path_map = {}
+        for count, path, file_id in sorted(hit_list, reverse=True):
+            if path in seen_paths or file_id in seen_file_ids:
+                continue
+            path_map[path] = file_id
+            seen_paths.add(path)
+            seen_file_ids.add(file_id)
+        return path_map


^- You don't need to keep a "seen_paths" set, as you can just say:

if path in path_map or file_id in seen_file_ids:

Because a dict lookup is identical to a set lookup, and every path that
you see gets put into the path_map. (This should save a reasonable
amount of memory.)

...

+    def guess_renames(klass, tree):
+        """Guess which files to rename, and perform the rename.
+
+        We assume that unversioned files and missing files indicate that
+        versioned files have been renamed outside of Bazaar.
+        """
+        required_parents = {}
+        task = ui_factory.nested_progress_bar()
+        try:
+            pp = progress.ProgressPhase('Guessing renames', 4, task)
+            basis = tree.basis_tree()
+            basis.lock_read()
+            try:

^- You don't lock "tree" in this function, but you do lock
tree.basis_tree(). Are you assuming that "tree" is already locked? If
so, you should probably add that statement in the docstring. (You need a
":param tree:" anyway.

...

+    def _update_tree(self, required_parents, matches):
+        self.tree.add(required_parents)
+        reversed = dict((v, k) for k, v in matches.iteritems())
+        child_to_parent = sorted(
+            matches.values(), key=lambda x: reversed[x], reverse=True)
+        self.tree.unversion(child_to_parent)
+        paths_forward = sorted(matches.keys())
+        file_ids_forward = [matches[p] for p in paths_forward]
+        self.tree.add(paths_forward, file_ids_forward)

^- Doing this via 'unversion' followed by 'add' doesn't seem quite right
versus doing it via "rename" operations. (As a possible problem, imagine
you have a directory where some children are known, and some are not. If
you then unversion() the directory, I think it would remove all known
children, and you would only be adding back the previously unknown
children.)

I don't have a correctness understanding versus the rest of your
changes, just a question about doing it this way.


=== added file 'bzrlib/tests/blackbox/test_guess_renames.py'
- --- bzrlib/tests/blackbox/test_guess_renames.py	1970-01-01 00:00:00 +0000
+++ bzrlib/tests/blackbox/test_guess_renames.py	2009-03-12 06:39:43 +0000
@@ -0,0 +1,22 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+from  bzrlib import tests
+
+class TestGuessRenames(tests.TestCaseWithTransport):
+

^- you have an extra <space> between from and bzrlib. And you are
missing a vertical space before class.



Overall, I think the functionality and algorithms are good. I'm not sure
if it should be a core command or not...

 BB:tweak

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknHsrIACgkQJdeBCYSNAAOgYACg0CTklrAJwD3h2S/R/byeT8dg
5g8AnAtdjADoT2uWKKv92wCLu33W47+K
=zyvl
-----END PGP SIGNATURE-----



More information about the bazaar mailing list