[MERGE] Implement guess-renames

Mon Mar 23 16:44:09 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> This patch implements a guess-renames command.  Assume some other
>> process has renamed some of your files.  This will guess what renames
>> were performed and update bzr accordingly.

> This seems like it would fit well as a plugin, though if you have reason
> to make this a core command, I certainly wouldn't refuse. (The only
> argument against is the standard "proliferation of commands" one.)

I think that the need for this is pretty common, so we should provide an
answer in core.  It also makes it more convenient for plugins.  Anyhow,
people seem eager to suck bzrtools into core, so if I stuck it there, it
wouldn't stay long.

> +            for num, (file_id, contents) in enumerate(
> +                tree.iter_files_bytes(desired_files)):
> +                task.update('Calculating hashes', num, len(file_ids))
> +                s = StringIO()
> +                s.writelines(contents)
> +                s.seek(0)
> +                self.add_edge_hashes(s.readlines(), file_id)
> 
> ^- I think you can use "osutils.chunks_to_lines()" to do this a lot more
> efficiently than writing into a StringIO and the splitting. (Assuming
> that contents is 'mostly lines' and you want to make sure that it is
> 'exactly lines'.)

Okay.

> +                task.update('Determining hash hits', num, len(paths))
> +                my_file = self.tree.get_file(None, path=path)
> +                try:
> +                    hits = self.hitcounts(my_file.readlines())
> +                finally:
> +                    my_file.close()
> 
> It seems more reasonable to use: self.tree.get_file_lines()
> or if that isn't officially a Tree api, then doing:
> 
> my_file = self.tree.get_file(...)
> try:
>   my_lines = my_file.readlines()
> finally:
>   my_file.close()
> hits = self.hitcounts(my_lines)
> 
> Mostly because it seems silly to keep the file open for the whole
> duration of "hitcounts()" rather than just long enough to read its content.

I don't understand why that seems silly.  It's not a long duration and
we're not holding a lot of files open at once.

However, I didn't realize that get_file_lines supports a path.  Since it
does, I'll use that.

> 
> ...
> 
> +    @staticmethod
> +    def _match_hits(hit_list):
> +        """Using a hit list, determin a path-to-fileid map.
> ^- 'determine'

Thanks.

> +        seen_file_ids = set()
> +        seen_paths = set()
> +        path_map = {}

> +        return path_map
> 
> 
> ^- You don't need to keep a "seen_paths" set, as you can just say:
> 
> if path in path_map or file_id in seen_file_ids:

Thanks, I must have refactored that into stupidity.

> +    def guess_renames(klass, tree):
> +        """Guess which files to rename, and perform the rename.
> +
> +        We assume that unversioned files and missing files indicate that
> +        versioned files have been renamed outside of Bazaar.
> +        """
> +        required_parents = {}
> +        task = ui_factory.nested_progress_bar()
> +        try:
> +            pp = progress.ProgressPhase('Guessing renames', 4, task)
> +            basis = tree.basis_tree()
> +            basis.lock_read()
> +            try:
> 
> ^- You don't lock "tree" in this function, but you do lock
> tree.basis_tree(). Are you assuming that "tree" is already locked? If
> so, you should probably add that statement in the docstring. (You need a
> ":param tree:" anyway.

Done.

> 
> ...
> 
> +    def _update_tree(self, required_parents, matches):
> +        self.tree.add(required_parents)
> +        reversed = dict((v, k) for k, v in matches.iteritems())
> +        child_to_parent = sorted(
> +            matches.values(), key=lambda x: reversed[x], reverse=True)
> +        self.tree.unversion(child_to_parent)
> +        paths_forward = sorted(matches.keys())
> +        file_ids_forward = [matches[p] for p in paths_forward]
> +        self.tree.add(paths_forward, file_ids_forward)
> 
> ^- Doing this via 'unversion' followed by 'add' doesn't seem quite right
> versus doing it via "rename" operations. (As a possible problem, imagine
> you have a directory where some children are known, and some are not. If
> you then unversion() the directory, I think it would remove all known
> children, and you would only be adding back the previously unknown
> children.)

Because the directory is unknown, all of its children are unknown.
There would be some benefit to keeping the previous child list for
handling zero-length files, but all of the other children will have been
handled by the main code.

> I don't have a correctness understanding versus the rest of your
> changes, just a question about doing it this way.

There aren't a lot of great APIs for this.  I need to handle the case
where some parents have existing file-ids and some don't.  I could go
for apply_inventory_delta, but I'll have to look up a bunch of
parent_ids from multiple sources.

> ^- you have an extra <space> between from and bzrlib. And you are
> missing a vertical space before class.

Fixed.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknHvFgACgkQ0F+nu1YWqI0i9ACdH4v2o/OU0UaI0HDRhDUNlvZz
ajMAnAyFi96JYl8WifV4OVnHzzkdur9S
=7WoR
-----END PGP SIGNATURE-----