[MERGE/RFC] case sensitivity on Windows

Mon Dec 22 06:05:30 GMT 2008

Hi Mark,

Thanks for working on this and apologies for taking so long to
getting to reviewing it. When I read through this patch, it really
pushes home to me how important some fundamental Bazaar design
choices - like treating directories as 1st class items - are w.r.t.
Doing The Right Thing for our end users.

There's a lot to like about this patch: design thought/care w.r.t.
performance, plenty of tests, technical documentation. I've only
requested some tweaks but I'm not 100% confidence that *I* know
enough to safely vote 'tweak' yet.

bb:comment

Mark Hammond wrote:
> On 20/12/2008 5:09 AM, John Arbash Meinel wrote:
> 
>> Which just as a start makes me wonder about the layering. Certainly
>> we can move case_sensitive up into MutableTree, and just have it
>> default to True (since Memory implementations would be case
>> sensitive, etc.)
> 
> Done.

When does this case_sensitive attribute ever get set to False?
(Sorry if that's a dumb question.)

>> So I'm thinking we should probably just promote
>> 'cicp_canonical_relpath' to be plain 'canonical_relpath'
>> and just try to be careful about when we use it.

> So: taking into account your other comments, here is what I've come up
> with:
> 
> * There is no WorkingTree.canonical_relpath() function - people who need
> that functionality call osutils directly.
> 
> * osutils sticks with the name 'canonical_relpath()' and adds
> 'canonical_relpaths()' - hopefully the fact they live in osutils implies
> we are talking about the file-system.
> 
> * Tree has methods 'get_canonical_inventory_paths' and
> 'get_canonical_inventory_paths()'.  The names are long, but I hope they
> are now unambiguous.

Seems good enough to me (though John may well offer improvements).

>> I guess the only reason not to do that is because you may be talking
>> about the path-on-disk rather than the path-in-the-tree, which can
>> certainly give different results, and why all of this is really hard to
>> get right, because you never now for a string whether it is the path on
>> disk, or the path in memory, etc.
> 
> Exactly - each consumer of tree_files() needs to be examined and the
> appropriate action for that command taken, possibly depending on other
> params, as we saw above.  As you note, in tree_files() you aren't sure
> which "canonical" version is desired, if any, but each consumer of
> tree_files() should only have one unambiguous requirement in this regard
> - hence the burden really does need to fall on each individual command.
>  I did try passing a param down through tree_files() for this purpose,
> but it got ugly quickly...

Hmm - that *really* is unfortunate because it implies that cicp support
will get added/debugged one command at a time. I wonder whether you
ought to change safe_relpath_files() to cover the most common case
(strings are inventory names) and add the special case handling into
the commands (only add and mv?) that behave otherwise?

I certainly know that in the case of adding Filtered Views support,
being able to tweak tree_files() (to look up the filtered view if one
is enabled) was one of the keys to making the feature viable w.r.t. both
initial coding *and* ongoing support. Not that my code is up for review
yet ... :-) :-)

If I'm flogging a dead horse here, apologies. (Please point me to the
list archives where this was flogged in my absence in that case.)

>> I bring it up, because we may want to *only* provide the multi-version.
> 
> Good point - I experimented with removing the singular version, but
> unfortunately that did cost some readability and loss of clarity - eg, I
> ended up with:
> 
> |tree.get_canonical_inventory_paths([osutils.dirname(rel_names[1])])[0]
> 
> which is quite a bit more complicated than:
> 
> |tree.get_canonical_inventory_path(osutils.dirname(rel_names[1]))
> 
> Even if the first form is split over multiple lines I wasn't
> particularly happy with the result.
> 
> The updated patch includes singular and multiple versions of both
> functions, with the implementation of caching in the multiple versions
> left as a future optimization.

I think it's worth having both versions.

Some explicit review tweaks ...

> +    * When working on a case-insensitive case-preserving file-systems, as
> +      commonly found with Windows, bzr will often ignore the case of the
> +      arguments specified by the user in preference to the case of an existing
> +      item on the file-system or in the inventory to help prevent
> +      counter-intuitive behaviour on Windows.

Be sure to take the credit by adding your name to this end of this NEWS item.

> +    def fix_case_of_inventory_path(self, path):
> +        """If our tree isn't case sensitive, return the canonical path"""
> +        if not self.case_sensitive:
> +            path = self.get_canonical_inventory_path(path)
> +        return path

I'd prefer this to be private. Among other reasons, all public APIs
(by project convention) need an explicit test case confirming that they can be
called. This prevents public APIs being accidentally removed.

> +        # than to convert an abspath to tree relative, and its cheaper to
> +        # perform the canonicalization in bulk.

s/its/it's/

> +    TODO: it should be possible to optimize this for Windows by using the
> +    win32 API FindFiles function to look for the specified name - but using
> +    os.listdir() still gives us the correct, platform agnostic semantics in
> +    the short term.

I think this comment ought to be out of the docstring.

> +    # use an explicit iterator so we can easily consume the rest on early exit.
> +    bit_iter = rel.split('/')
> +    for bit in bit_iter:
> +        lbit = bit.lower()

Comparing this code to _yield_canonical_inventory_paths in tree.py, is there a
reason why you did iter(path.split("/")) in the latter but haven't used iter()
around rel.split('/') above?

> +    def test_re_add(self):
> +        """Test than when a file has 'unintentionally' changed case, we can't
> +        add a new entry using the new case."""
> +        wt = self.make_branch_and_tree('.')
> +        # create a file on disk with the mixed-case name
> +        self.build_tree(['MixedCase'])
> +        self.check_output('added MixedCase\n', 'add MixedCase')
> +        # 'accidently' rename the file on disk
> +        os.rename('MixedCase', 'mixedcase')
> +        self.check_output('', 'add mixedcase')

I think it's worth checking stderr and the return code here - not just stdout.

> +    def test_re_add_dir(self):
> +        # like re-add, but tests when the operation is on a directory.
> +        """Test than when a file has 'unintentionally' changed case, we can't
> +        add a new entry using the new case."""
> +        wt = self.make_branch_and_tree('.')
> +        # create a file on disk with the mixed-case name
> +        self.build_tree(['MixedCaseParent/', 'MixedCaseParent/MixedCase'])
> +        self.check_output('added MixedCaseParent\nadded MixedCaseParent/MixedCase\n',
> +                          'add MixedCaseParent')
> +        # 'accidently' rename the directory on disk
> +        os.rename('MixedCaseParent', 'mixedcaseparent')
> +        self.check_output('', 'add mixedcaseparent')

Ditto.

> +    # The following commands need tests and/or cicp lovin':
> +    # update, remove, file_id, file_path, diff, log, touching_revisions, ls,
> +    # ignore, cat, revert.

'resolve' comes to mind as well.

> +    def test_canonical_path(self):
> +        work_tree = self.make_branch_and_tree('tree')
> +        self.build_tree(['tree/dir/', 'tree/dir/file'])
> +        work_tree.add(['dir', 'dir/file'])
> +        work_tree.commit('commit 1')
> +        self.assertEqual(work_tree.get_canonical_inventory_path('Dir/File'), 'dir/file')
> +
> +    def test_canonical_path_before_commit(self):
> +        work_tree = self.make_branch_and_tree('tree')
> +        self.build_tree(['tree/dir/', 'tree/dir/file'])
> +        work_tree.add(['dir', 'dir/file'])
> +        # note: not committed.
> +        self.assertEqual(work_tree.get_canonical_inventory_path('Dir/File'), 'dir/file')
> +

The first 3 lines of this bunch of test methods are identical so I'd
suggest refactoring them into a helper method.

> +    def get_canonical_inventory_paths(self, paths):
> +        """Returns a list with each item the first path that
> +        case-insensitively matches the specified input paths.

Nice job on the docstrings throughout this patch. My only gripe is
that the top line needs to stand-alone as a short description and you're
often taking two lines for that. For test methods, I don't care. For
methods like this though, I think sticking to a single line is good
because it improves the quality of our technical documentation sites
like http://starship.python.net/crew/mwh/bzrlibapi/. (That's linked
from http://bazaar-vcs.org/Documentation BTW).

> +    def get_canonical_inventory_path(self, path):
> +        """A convenience version of get_canonical_inventory_path which
> +        takes a single path.

Two lines again. And s/_path/_paths/.

> +Case Insensitive File Systems
> +=============================
> +
> +Bazaar must be portable across operating-systems and file-systems.  While the
> +primary file-system for an operating-system might have some particular
> +characteristics, it's not necessary that *all* file-systems for that
> +operating-system will have the same characteristics.
> +

Nicely written doc. Please link it into doc/developers/index.txt under
Specifications.

> +There are two places this match can be performed against - the file-system
> +and the Bazaar inventory.  When looking at the file-system, it is generally
> +impossible to have 2 names that differ only by case, so there is no ambiguity.

Hmmm - "generally impossible" is assuming a CI filesystem. I appreciate that
this whole section is probably assuming that? I'd prefer we made it explicit
by saying ...

"When looking at a case-insensitive filesystem, it is impossible ..."

Ian C.