[MERGE/RFC] case sensitivity on Windows

Sun Dec 21 10:32:32 GMT 2008

On 20/12/2008 5:09 AM, John Arbash Meinel wrote:

 > Which just as a start makes me wonder about the layering. Certainly
 > we can move case_sensitive up into MutableTree, and just have it
 > default to True (since Memory implementations would be case
 > sensitive, etc.)

Done.

 > So I'm thinking we should probably just promote
 > 'cicp_canonical_relpath' to be plain 'canonical_relpath'
 > and just try to be careful about when we use it.

I was a little concerned that in the few places we *do* need to use it, 
the hit for non-Windows operating systems would be of concern.  I 
figured it was better to only enable that functionality on Windows 
(canonical_relpath is aliased to relpath in other cases) until we have 
better experience with it.

I've left it named '_cicp_canonical_relpath()' and left the sys.platform 
check before it is enabled - let me know if you still want that changed.

 > So in discussing canonical_relpath, why do you need to call abspath()
 > before you start iterating it?

Primarily because it is easier and is "more obviously" similar to the 
normal relpath implementation.  This is an implementation detail we 
could tweak later, right?

> I'm also trying to understand "tree.get_canonical_path".
> "osutils.canonical_relpath" means to use a case-insensitive match
> against the OS's filesystem.
>
> I think the problem is using the term "canonical" to mean 2 different
> things. Sometimes it means the path on disk, and sometimes it means to
> use the path in the tree. Put another way,
> tree.get_canonical_path(path) != osutils.canonical_relpath(tree.basedir,
> path)
>
> Which, IMO, means it isn't a *canonical* path.

The *real* problem is that we have 2 different stores, each of which has 
a canonical version of a name.  Indeed, a key part of this patch is 
handling the fact there are 2 different 'canonical' versions of a file 
which may differ by case.

> I may be wrong on this. But if I do:
>
> mv TODO Todo
>
> python -c "from bzrlib import osutils, workingtree; wt =
> workingtree.WorkingTree.open('.'); print
> osutils.canonical_relpath(wt.basedir, 'todo'),
> wt.get_canonical_path('todo')"
>
> Todo TODO

Exactly.  This simple fact demonstrates the need for the special 
handling this patch introduces.

> There is also a small issue that we have "tree.relpath()" which matches
> "osutils.relpath()", but you added "tree.get_canonical_path()" as the
> corallary to "osutils.canonical_relpath()".
>
> For clarity, I would probably change it a bit. And have:
>
> osutils.case_insensitive_relpath()
> and
> tree.case_insensitive_relpath()

I'll happily use whatever name you agree, but at first glance, 
'case_insensitive_relpath()' doesn't give the casual reader a good clue 
about what is returned.  For example, it wouldn't surprise me to find a 
function with that name simply returned the .lower() version of the 
filename.

I like the word 'canonical' or similar as it implies it is matching the 
name against something and returning the internally stored variant of 
the name.  As you noted above, the complication is the 2 sources of 
canonical names :(

So: taking into account your other comments, here is what I've come up with:

* There is no WorkingTree.canonical_relpath() function - people who need 
that functionality call osutils directly.

* osutils sticks with the name 'canonical_relpath()' and adds 
'canonical_relpaths()' - hopefully the fact they live in osutils implies 
we are talking about the file-system.

* Tree has methods 'get_canonical_inventory_paths' and 
'get_canonical_inventory_paths()'.  The names are long, but I hope they 
are now unambiguous.

I think that should address most concerns, but please let me know if you 
would like to decree other specific names I should use - I'm really not 
attached to the names :)

> You changed 'bzr mv' to use get_canonical_relpath() for a lot of the
> arguments. But there was one place:
> into_existing = osutils.isdir(names_list[-1])
> The very first access to the target directory. Before we pass it to
> "isdir()" shouldn't it have gone through osutils.canonical_relpath() ?
>
> So someone can type "bzr mv path to WrongCaSe"
> and still have it find "wrongcase" which is an existing file on disk?
>
> Then again, I guess if the filesystem is case-insensitive, then isdir()
> will find it anyway.

That's correct and covered by the tests.  So it's OK to leave that?

> This code in "cmd_mv":
> + # Find the canonical version of the destination: In all cases, the
> + # parent of the target must be in the inventory, so we fetch the
> + # canonical version from there.
> + dest_parent = tree.get_canonical_path(osutils.dirname(rel_names[1]))
> + dest_tail = osutils.basename(rel_names[1])
> + if after:
> + # If 'after' is specified, the tail must refer to a file on disk,
> + # so fixup the tail using the file-system.
> + if dest_parent:
> + dest_parent_fq = osutils.pathjoin(tree.basedir, dest_parent)
> + else:
> + # pathjoin with an empty tail adds a slash, which breaks
> + # relpath :(
> + dest_parent_fq = tree.basedir
> +
> + dest_tail = osutils.canonical_relpath(
> + dest_parent_fq,
> + osutils.pathjoin(dest_parent_fq, dest_tail))
> + dest = osutils.pathjoin(dest_parent, dest_tail)
>
> This seems like something that should be a separate-and-tested function
> on Tree.

Possibly - although that appears to be the only code in the tree that 
needs the functionality, and it really is specialized to 'mv'; it is 
only used when we have already checked the destination isn't an existing 
directory and has special casing for the --after arg.  So I'm not sure I 
see much to be gained by splitting it out and testability can be handled 
just fine by blackbox tests?

If you think it still should be refactored, in the efforts of reducing 
the round-trips please let me know where it should be moved to and how 
it should be named.

> Actually, unless I misunderstand the code in
> get_canonical_relpath, I think this case is already handled.
> Specifically, when the for loop is exhausted, you have:

The case being handled here is an explicit rename to a new case.  If we 
normalize the tail portion of the name, we lose the case that the user 
specified - ie, it would then be impossible to do a case-only rename.

> else:
> # got to the end of this directory and no entries matched.
> # Return what matched so far, plus the rest as specified.
> cur_path = osutils.pathjoin(cur_path, elt, *list(bit_iter))
> break
>
> Though perhaps this doesn't work for the case of:
>
> mkdir Foo; touch file; Foo/BAR
> bzr add file Foo/BAR
> bzr commit
> bzr mv File foo/bar

Actually, if you try that now you get an error about "bar" already 
existing and that --after should be used.  I did manage to find some 
other 'mv' related edge cases which I've captured in blackbox tests, 
which required rework of this part of the patch.

> If the tip already exists (in the filesystem, or in the tree) then it
> should be used, instead of

Yeah - I think I've now got most of the 'mv' use-cases spelled out in 
the tests; a complication I didn't handle before was the need to prevent 
creating new inventory items when an existing item that differs by case 
exist.

> ...
>
> in cmd_remove:
> tree, file_list = tree_files(file_list)
>
> if file_list is not None:
> - file_list = [f for f in file_list]
> + file_list = [tree.get_canonical_path(f) for f in file_list]
>
>
> I think this is correct in this case. But given what "tree_files()"
> really means, it feels to me that we likely should be doing it all the
> time.
> So instead of updating every command to wrap the file_list with a
> "get_canonical_relpath()" call, we should instead change the
> implementation of "safe_relpath_files()"

My first cut at the patch did exactly that, but it turned out to not be 
appropriate, as you note:

> I guess the only reason not to do that is because you may be talking
> about the path-on-disk rather than the path-in-the-tree, which can
> certainly give different results, and why all of this is really hard to
> get right, because you never now for a string whether it is the path on
> disk, or the path in memory, etc.

Exactly - each consumer of tree_files() needs to be examined and the 
appropriate action for that command taken, possibly depending on other 
params, as we saw above.  As you note, in tree_files() you aren't sure 
which "canonical" version is desired, if any, but each consumer of 
tree_files() should only have one unambiguous requirement in this regard 
- hence the burden really does need to fall on each individual command. 
  I did try passing a param down through tree_files() for this purpose, 
but it got ugly quickly...

...
> Also, seeing the number of times where you need to do something like:
> + file_list = [tree.get_canonical_path(f) for f in file_list]
>
> I wonder if we shouldn't have a specific function for this. Something like:
>
> file_list = tree.canonical_relpaths(file_list)

Yeah - added tree.canonical_inventory_relpaths(), and the equivalent 
osutils.canonical_relpaths(), although I've left the caching to later.

>
> The idea is that could be a very good way to implement appropriate
> caching. The cache would exist for only a single call, but could be
> preserved until all entries had been handled.
> It could also mean that multiple entries could be resolved 'in
> parallel', rather than determining that all of the files in the "Foo"
> directory were really in the "foo" directory.
>
> I bring it up, because we may want to *only* provide the multi-version.

Good point - I experimented with removing the singular version, but 
unfortunately that did cost some readability and loss of clarity - eg, I 
ended up with:

|tree.get_canonical_inventory_paths([osutils.dirname(rel_names[1])])[0]

which is quite a bit more complicated than:

|tree.get_canonical_inventory_path(osutils.dirname(rel_names[1]))

Even if the first form is split over multiple lines I wasn't 
particularly happy with the result.

The updated patch includes singular and multiple versions of both 
functions, with the implementation of caching in the multiple versions 
left as a future optimization.

Cheers,

Mark
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cicp_filesystem.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081221/b5bb3953/attachment-0001.diff