[RFC] case sensitivity on Windows

Fri Oct 31 05:35:39 GMT 2008

On Thu, Oct 30, 2008 at 11:01 AM, Mark Hammond
<mhammond at skippinet.com.au> wrote:
> Sorry for the delay - I've been playing with this and making some progress, and it gets tricky fairly quickly :)  I thought I'd share my progress and ask a couple of questions:

That's still pretty nice progress.  I'll read the patch now, but as a
tip, you should really update the subject line to include [merge] so
that BB will catch it.  It's a good safety net to make sure it does
get an answer one way or the other.

> It addresses all the use-cases in http://bazaar-vcs.org/CasePreservingWorkingTreeUseCases (and adds blackbox tests for them) - except the last - handling renames...

That looks like a good list.  I wonder if there will be some other
cases that need to be handled in eg handling merges or conflicts.

If we merge this in I'd like to see that wiki page go into
doc/developer to reflect it's more formal status, and also a paragraph
or so pointing to how it's done.

+def _win32_canonical_relpath(base, path):
+    """Return the canonical path relative to base.
+
+    Like relpath, but on case-insensitive-case-preserving file-systems, this
+    will return the relpath as stored on the file-system rather than
in the case
+    specified in the input string, for all existing portions of the path.
+
+    TODO: it should be possible to optimize this by using the win32 API
+    FindFiles function to look for the specified name - but using os.listdir()
+    still gives us the correct semantics in the short term.
+    """
+    rel = relpath(base, path)
+    # '.' will have been turned into ''
+    if not rel:
+        return rel
+
+    abs_base = abspath(base)
+    current = abs_base
+    _listdir = os.listdir
+
+    # use an explicit iterator so we can easily consume the rest on early exit.
+    bit_iter = iter(rel.replace('\\', '/').split('/'))
+    for bit in bit_iter:
+        lbit = bit.lower()
+        for look in _listdir(current):
+            if lbit == look.lower():
+                current = pathjoin(current, look)
+                break
+        else:
+            # got to the end, nothing matched, so we just return the
+            # non-existing bits as they were specified (the filename may be
+            # the target of a move, for example).
+            current = pathjoin(current, bit, *list(bit_iter))
+            break
+    return current[len(abs_base)+1:]
+
+if sys.platform == "win32":
+    canonical_relpath = _win32_canonical_relpath
+else:
+    canonical_relpath = relpath
+
+

So this routine, as it currently stands, is not win32-specific (aside
from the comment), but in fact is the kind of thing we'd like to also
use on os x or linux case-insensitive filesystems.  So even if we're
not handling those cases at present, I'd name it accordingly,
cicp_canonical_relpath or something.  That also gets away a bit from
assuming the canonical relpath algorithm is determined only by the
platform, when in fact it seems like it may vary per tree (eg in the
vfat on linux case).

In that context folding \ to / here is questionable, if we want to
support such names on platforms that can actually handle them, but it
could probably be fixed when that bug is actually fixed.

There is some risk that doing listdir() in here for every path will
cause O(n) behaviour if we call this on every path in a tree.  I'd
probably just add a comment for now; it may not be significant.  An
easy optimization may be to check whether the name is in the list
before folding case.

It looks like this will only be run on paths coming from the user at
this time, and that's probably not many (especially on windows, with
less use of things like xargs).

So I'd be happy to merge this in as at least improving support for some cases.

>> So for something like "add" I would like a WorkingTree api for
>> canonicalizing the path that we are given. 'smart_add' has a nice
>> location for doing this when we:
>> # validate user file paths and convert all paths to tree ...
>
> I've come up with a couple of new functions - osutils.canonical_relpath() and WorkingTree.canonical_relpath().  It works exactly the same as relpath, but any parts of the returned relative string which already exist are normalized to how they exist on the file-system.  As the attached patch mentions in an XXX comment, it could be optimized by using win32api.FindFiles, but its probably better to have unoptimized-but-correct functionality than undesirable functionality (and specifically, it wasn't clear how to hook an 'optional' win32api function via win32utils into this purpose and I didn't want to get bogged down by details.)  As John mentioned, smart_add has a nice place for this, and for many other functions safe_relpath_files() in builtins.py is convenient (but not entirely correct)
>

>> As for the rest of the time...
>>
>> 'iter_changes' would need a way to track which files 'missed' on the
>> first pass, and then come back and correct it. We have a slight
>> advantage that we generally work in a directory-at-a-time, though there
>> are still problems. I believe the output order is defined as
>> sorted-per-directory, so you can't yield anything once you have a miss,
>> until you get to the end and resolve whether it was a *real* miss or
>> not.
>
> This is to handle the last 2 paragraphs in http://bazaar-vcs.org/CasePreservingWorkingTreeUseCases?  IOW, handling foo 'unintentionally' being renamed to Foo (ie, your example below, but assuming the mv was 'accidental' - eg due to the user specifying an 'incorrect' case to some other tool)?
>
>> touch foo
>> bzr add foo
>> mv foo Foo
>> bzr add Foo
>>
>> doesn't try to add Foo twice even though it has a different name now.
>
> [I'm actually surprised workingtree's 'case_sensitive' attribute doesn't prevent this already - it may be that attribute is only used during merges - but that's a bit of a digression for now]
>
> Right.  This is quite possible in the 'accidental' rename case and 'bzr add' wasn't given any args.  I'm starting to run into issues where I'm not clear of the desired semantics in this situation and it would help greatly to clarify them.
>
> Off the top my my head:
>
> (the above)
> % touch foo
> % bzr add foo
> % mv foo Foo
> % bzr add Foo
> => '' (ie, bzr should ignore the filesystem case for an existing entry)
>
> (the above, but the last 'add' is implicit)
> % touch foo
> % bzr add foo
> % mv foo Foo
> % bzr add
> => '' (ie, bzr should ignore the filesystem case for an existing entry)
> % bzr diff/status
> => '' (ie, nothing has changed)

This one is a bit interesting because it's saying that if the file was
renamed externally, we don't consider that it was "semantically"
renamed.  On Linux it would show as the newly added file being removed
and a new one present, until you confirmed that it had moved.

>
> Similarly if the *contents* of 'foo' changes along with the case:
> % touch foo
> % bzr add foo
> % mv foo Foo
> % echo hello > Foo
> % bzr diff/status
> => 'reports that "foo" has changed'
>
> But the user can explicitly rename either via bzr:
> % touch foo
> % bzr add foo
> % bzr mv foo Foo
> => 'renamed foo -> Foo' - and as usual, bzr will have done a rename on the file-system.
>
> Or manually before telling bzr:
> % touch foo
> % bzr add foo
> % mv foo Foo
> % bzr mv foo Foo
> => 'renamed foo -> Foo' (bzr doesn't need to touch the file-system)

Also

  touch foo
  bzr mkdir goo
  bzr add
  bzr mv FOO GOO
  => renamed ./foo ->./goo/foo
 (or arguably ./foo -> ./goo/FOO)

and also

  touch foo goo
  bzr add
  bzr mv Foo Goo
  => error

bzr mv has two almost separate ways of interpreting the command line

  bzr mv OLDNAME NEWNAME
  bzr mv OLDNAME... NEWDIR

and you've mostly been talking about the former.  In the latter, I
suppose we want to match case-insensitive against all the parameters,
which is not so hard.

> But to be consistent with the other scenarios where the file-system case doesn't match the command-line:
>
> % touch foo
> % bzr add foo
> % mv foo FOO
> % bzr mv foo Foo
> => 'renamed foo -> FOO' (ie, bzr used the FOO on the file-system in preference to Foo on the cmdline).
> Indeed, in the above example 'bzr mv foo foo' would also be interpreted as 'mv foo FOO'
>
> But we could (its not clear if we *should*) insist the case for the *source* of the move exactly matches the inventory:
> % touch foo
> % bzr add foo
> % mv foo FOO
> % bzr mv FOO Foo
> => 'ERROR: no entry FOO' (this possibly works already)

So this seems to be driving towards, in the two-argument (renaming
rather than moving) case of bzr mv:

The first argument, obviously enough, tries to match loosely against
an existing file.

The second argument then tries to match against an existing entry.

If it does match, and it matches the same thing as the first argument,
then we change only the case of the file, both in memory and on disk.
(Except, as a special case, we don't rename a file to a name it
already has.)

Otherwise, if it matches an existing directory, it's actually a move
into that directory.

Otherwise, if it matches any other versioned object, it's an error.
(file exists)

Otherwise, we rename both the inventory entry and the file on disk.

I don't think changing the case of existing files is the most
important use case though; but it would be good to handle it sensibly.

> So I guess I'm asking for 3 things (all of which can be done independently :)
>
> * Comments and general review of the patch as it stands and with the limitations described above.
>
> * Agreement/modifications to the scenarios above.
>
> * Suggestions for moving forward.  Publish the branch at launchpad and hope to get some collaborators?

Keep putting up little patches like this and/or specific questions,
we'll get them merged in, and seek feedback or bug reports from
others.

-- 
Martin <http://launchpad.net/~mbp/>