[MERGE/RFC] case sensitivity on Windows

Fri Dec 19 18:09:55 GMT 2008

John Arbash Meinel has voted resubmit.
Status is now: Resubmit
Comment:
I think this is getting close to what we would want. I think one of the 
biggest bits that caused "resubmit" was that you have a few functions 
with similar names that don't quite do the same thing. Overall, I'm 
really happy to see that it isn't as invasive as I thought it would be.

I would make "cicp_canonical_relpath" a private function, just because 
it isn't something you want people to use directly. They should be using 
the "osutils.canonical_relpath" function, and marking it 
"_cicp_canonical_relpath" should help point people in that direction.

That said, we want to support CICP filesystem on other platforms, such 
as Mac and vfat on Linux, so I would guess we actually want something 
like "canonical_relpath" to be available everywhere, but only used when 
WT.case_sensitive is True.

Which just as a start makes me wonder about the layering. Certainly we 
can move case_sensitive up into MutableTree, and just have it default to 
True (since Memory implementations would be case sensitive, etc.)

So I'm thinking we should probably just promote 'cicp_canonical_relpath' 
to be plain 'canonical_relpath' and just try to be careful about when we 
use it.

So in discussing canonical_relpath, why do you need to call abspath() 
before you start iterating it? I would also guess that you could build 
up just the relative portion, rather than starting with base and 
building on top of it, and then trimming out the base as the last step 
before you return. Also, I notice you don't special case an exact-match. 
Specifically, you *always* iterate the results of listdir() (or 
tree.iter_children(root_id)), rather than making an attempt to see if 
there exists an entry with the exact name you are looking for. I have 
the feeling that most times the paths come in exactly as they are on 
disk (especially if someone takes advantage of TAB completion), and if 
there is a directory with a lot of files, it could save a significant 
amount of work.
These are minor tweaks/optimizations though, and I'll try to focus more 
on design issues.

...

=== modified file 'bzrlib/builtins.py'

--- bzrlib/builtins.py  2008-11-15 20:37:36 +0000
+++ bzrlib/builtins.py  2008-11-17 12:16:33 +0000
@@ -210,6 +210,8 @@
                  no_pending = True
          # A specific path within a tree was given.
          elif relfile_list is not None:
+            # convert the names to canonical versions as stored in the 
inventory
+            relfile_list = [tree.get_canonical_path(f) for f in 
relfile_list]
              no_pending = True
          show_tree_status(tree, show_ids=show_ids,
                           specific_files=relfile_list, 
revision=revision,

^- For example, this might become:

elif relfile_list is not None:
   if not tree.case_sensitive:
       relfile_list = [tree.get_canonical_path(f) for f in relfile_list]

I'm also trying to understand "tree.get_canonical_path".
"osutils.canonical_relpath" means to use a case-insensitive match 
against the OS's filesystem.

I think the problem is using the term "canonical" to mean 2 different 
things. Sometimes it means the path on disk, and sometimes it means to 
use the path in the tree. Put another way,
tree.get_canonical_path(path) != osutils.canonical_relpath(tree.basedir, 
path)

Which, IMO, means it isn't a *canonical* path.

I may be wrong on this. But if I do:

mv TODO Todo

python -c "from bzrlib import osutils, workingtree; wt = 
workingtree.WorkingTree.open('.'); print 
osutils.canonical_relpath(wt.basedir, 'todo'), 
wt.get_canonical_path('todo')"

Todo TODO

There is also a small issue that we have "tree.relpath()" which matches 
"osutils.relpath()", but you added "tree.get_canonical_path()" as the 
corallary to "osutils.canonical_relpath()".

For clarity, I would probably change it a bit. And have:

osutils.case_insensitive_relpath()
and
tree.case_insensitive_relpath()

And in: tree.case_insensitive_relpath(), we might do something like:
if self.case_sensitive:
    return self.relpath
...

I'm not 100% sure about that. If there are times where you may want to 
do a case-insensitive match on a case-sensitive tree. Either way, I 
would name them the same, and probably name them "XXX_relpath".


...


You changed 'bzr mv' to use get_canonical_relpath() for a lot of the 
arguments. But there was one place:
         into_existing = osutils.isdir(names_list[-1])
The very first access to the target directory. Before we pass it to 
"isdir()" shouldn't it have gone through osutils.canonical_relpath() ?

So someone can type "bzr mv path to WrongCaSe"
and still have it find "wrongcase" which is an existing file on disk?

Then again, I guess if the filesystem is case-insensitive, then isdir() 
will find it anyway.



I think it would be really good if we could bring in a 
CaseInsensitiveMemoryTree, to make it easier to test these things on all 
platforms. But I guess we don't abstract direct access to the filesystem 
enough to make that actually work. (TreeTransform, for example, goes 
direct to open() calls, etc.)


This code in "cmd_mv":
+            # Find the canonical version of the destination:  In all 
cases, the
+            # parent of the target must be in the inventory, so we 
fetch the
+            # canonical version from there.
+            dest_parent = 
tree.get_canonical_path(osutils.dirname(rel_names[1]))
+            dest_tail = osutils.basename(rel_names[1])
+            if after:
+                # If 'after' is specified, the tail must refer to a 
file on disk,
+                # so fixup the tail using the file-system.
+                if dest_parent:
+                    dest_parent_fq = osutils.pathjoin(tree.basedir, 
dest_parent)
+                else:
+                    # pathjoin with an empty tail adds a slash, which 
breaks
+                    # relpath :(
+                    dest_parent_fq = tree.basedir
+
+                dest_tail = osutils.canonical_relpath(
+                                dest_parent_fq,
+                                osutils.pathjoin(dest_parent_fq, 
dest_tail))
+            dest = osutils.pathjoin(dest_parent, dest_tail)

This seems like something that should be a separate-and-tested function 
on Tree. Actually, unless I misunderstand the code in 
get_canonical_relpath, I think this case is already handled.
Specifically, when the for loop is exhausted, you have:
             else:
                 # got to the end of this directory and no entries 
matched.
                 # Return what matched so far, plus the rest as 
specified.
                 cur_path = osutils.pathjoin(cur_path, elt, 
*list(bit_iter))
                 break

Though perhaps this doesn't work for the case of:

   mkdir Foo; touch file; Foo/BAR
   bzr add file Foo/BAR
   bzr commit
   bzr mv File foo/bar

In this case, your code would translate "File" => "file", and then it 
would translate "foo/bar" into "Foo/bar", rather than "Foo/BAR".
I could be wrong, but I have the feeling that the plain implementation 
of "get_canonical_relpath" is correct here. If the tip already exists 
(in the filesystem, or in the tree) then it should be used, instead of 
using the exact name supplied by the user, otherwise the code is already 
designed to return the path unchanged.

...

in cmd_remove:
          tree, file_list = tree_files(file_list)

          if file_list is not None:
-            file_list = [f for f in file_list]
+            file_list = [tree.get_canonical_path(f) for f in file_list]


I think this is correct in this case. But given what "tree_files()" 
really means, it feels to me that we likely should be doing it all the 
time.
So instead of updating every command to wrap the file_list with a 
"get_canonical_relpath()" call, we should instead change the 
implementation of "safe_relpath_files()"

So instead of doing:
     new_list.append(tree.relpath(osutils.dereference_path(filename)))
We do:
new_list.append(tree.get_canonical_path(osutils.dereference_path(filename)))

I guess the only reason not to do that is because you may be talking 
about the path-on-disk rather than the path-in-the-tree, which can 
certainly give different results, and why all of this is really hard to 
get right, because you never now for a string whether it is the path on 
disk, or the path in memory, etc.

...

=== modified file 'bzrlib/mutabletree.py'
--- bzrlib/mutabletree.py       2008-10-03 23:42:56 +0000
+++ bzrlib/mutabletree.py       2008-11-17 06:38:07 +0000
@@ -348,7 +348,7 @@
          # relative : it's cheaper to make a tree relative path an 
abspath
          # than to convert an abspath to tree relative.
          for filepath in file_list:
-            rf = _FastPath(self.relpath(filepath))
+            rf = _FastPath(self.canonical_relpath(filepath))
              # validate user parameters. Our recursive code avoids 
adding new files
              # that need such validation
              if self.is_control_filename(rf.raw_path):

^- At first, I was almost positive that this is wrong. I thought it 
either needed to be "osutils.canonical_relpath" or 
"self.get_canonical_path()" because there is no Tree.canonical_relpath() 
function that I could find.

However, it turns out that you implemented "canonical_relpath" on 
WorkingTree (though not on MutableTree or on Tree), which was a direct 
thunk to osutils.canonical_relpath(self.basedir, path)

Again, it stresses the need to be very clear about when you are asking 
for the case-insensitive path on disk, and when you are asking for the 
case-insensitive path in the tree (inventory). I originally was pushing 
to have the functions named the same, and calling the "Tree.XXX" would 
give you the path in the tree, and calling "osutils.XXX" would give you 
the path on disk. I see that you added a "Tree.XXX" and a "Tree.YYY" 
which gives you one or the other.

I see a few possibilities

a) Clearer and obviously separate function names, such as 
"Tree.filesystem_relpath()", or "Tree.cicp_filesystem_relpath()", versus 
"Tree.cicp_relpath()". Though I sort of want it to mean "give me the 
case-insensitive path if the fs is case-insensitive, otherwise give the 
case-sensitive path". And putting cicp seems a bit too strong that we 
would always give the case-insensitive form.

b) An argument to the function: "Tree.cicp_relpath(path, 
filesystem=True)"

c) Not using a function on Tree when you want the path from disk. (ie, 
always spelling out "osutils.canonical_relpath(tree.basedir, path)" when 
you want a path from disk)

...


+class _CaseInsCasePresFilenameFeature(Feature):
+    """Is the file-system case insensitive, but case-preserving?"""
+
+    def _probe(self):
+        fileno, name = tempfile.mkstemp(prefix='MixedCase')
+        try:
+            # first check truly case-preserving for created files, then 
check
+            # case insensitive when opening existing files.
+            name = osutils.normpath(name)
+            base, rel = osutils.split(name)
+            found_rel = osutils.canonical_relpath(base, name)
+            return (found_rel==rel
+                    and os.path.isfile(name.upper())
+                    and os.path.isfile(name.lower()))
+        finally:
+            os.close(fileno)
+            os.remove(name)
+
+CaseInsCasePresFilenameFeature = _CaseInsCasePresFilenameFeature()
+
+

^- I think you also want to define:
   def feature_name(self):
     return "case-insensitive case-preserving filesystem"

=== added file 'bzrlib/tests/blackbox/test_filesystem_cicp.py'
--- bzrlib/tests/blackbox/test_filesystem_cicp.py       1970-01-01 
00:00:00 +0000
+++ bzrlib/tests/blackbox/test_filesystem_cicp.py       2008-11-17 
12:15:58 +0000

...

+"""Tests variations of case-insensitive and case-preserving 
file-systems."""
+
+import os
+
+from bzrlib.tests.blackbox import ExternalBase
+from bzrlib.tests import CaseInsCasePresFilenameFeature, KnownFailure
+
+class TestIncorrectUserCase(ExternalBase):


^- Two blank lines between imports and a class definition.
This is really a time when it would be nice to have things abstracted 
such that we can use a CaseInsensitiveWorkingTree.... It probably isn't 
feasible in the short-term, though.



Also, seeing the number of times where you need to do something like:
+            file_list = [tree.get_canonical_path(f) for f in file_list]

I wonder if we shouldn't have a specific function for this. Something 
like:

file_list = tree.canonical_relpaths(file_list)

The idea is that could be a very good way to implement appropriate 
caching. The cache would exist for only a single call, but could be 
preserved until all entries had been handled.
It could also mean that multiple entries could be resolved 'in 
parallel', rather than determining that all of the files in the "Foo" 
directory were really in the "foo" directory.

I bring it up, because we may want to *only* provide the multi-version. 
Certainly you can always do:

path = tree.canonical_relpaths([path])[0]

And then we only need to provide 1 function, rather than 2.

Again, this isn't something that you have to do now. Just an idea about 
how we might want to structure it.


I didn't read the doc closely, though I'm happy to see documentation 
being added.



For details, see: 
http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C4934FA08.5010105%40gmail.com%3E
Project: Bazaar