File names normalization

John Arbash Meinel john at arbash-meinel.com
Fri Sep 12 17:11:37 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:

...

>     john> I we don't force normalization on Win32 and Linux, but
>     john> *do* enforce it on Mac (because we know Mac "renames"
>     john> stuff), then *sometimes* people will commit
>     john> non-normalized filenames on Windows, and then when they
>     john> check it out on Mac it will appear renamed.
> 
> Indeed. I can't see how to address that since there is an
> inherent contradiction.
> 
> Either we have a canonical representation (NFC) and we may need
> to rename files (but people should not be able to visually
> notice[1]) or we don't. In the later case, we can't guarantee
> that the files can be checked out on Mac.
> 
> I view the later case as somewhat similar to the case sensitive
> issues because it allows two different files with the same NFC
> form to coexist.

I agree that it is quite similar.

> 
>     john> Compare that to never normalizing, where people check
>     john> in NFC names on Linux, and then we re-normalize for
>     john> Mac. Probably we would get *more* cases correct, but we
>     john> still lose edge cases.
> 
> Yes, that's why I would prefer to always normalize (whatever
> canonical form we chose).

Actually, my point here was to *not* normalize on any platform *except*
Mac. So when someone checks in an NFD path on Linux, it will break on
Mac. But as the common case is checking in NFC on Linux and having it
silently translated to NFD on Mac, we at least get the common case right.

...

>     john> And hasn't been proven to be a win for the "majority"
>     john> of users. (Why force NFC encoding on Windows/Linux for
>     john> people who may never check the code out on Mac? etc)
> 
> Can you remind me why it was a loss (except for [1]) ?

You didn't give a [1].

Fundamentally, everyone on platforms that don't normalize are being
penalized because a (minority) platform that they probably don't even
use does normalize. The tradeoff doesn't seem great.

We had people on Japanese Windows unable to add their files because it
used non-normalized filenames. (Preferring wide-characters instead of
the 'classic' ones.)

We *could* ask them to rename their files to fit convention, but there
is no guarantee that it would be (a) easy, or (b) stable. If Office
wants to write wide-characters, then it probably is going to continue to
do so.


In the end, it is a *lot* easier to not normalize at all. People who are
on a Mac and commit files will have all their files in NFD. When that is
checked out on Linux/Win32 it will stay in NFD. (Those will look funny
in Explorer, but otherwise should be fine.)

If someone checks in an NFC on Linux and then checks it out on another
Linux or Windows machine, everything stays fine. If they check in an NFD
or *mixed* normalization, everything works fine. The only case that
breaks is the person who checks out a project not using NFD onto a Mac.
So the only person penalized is the person using the system that chose
to be different. We can provide some support *on that system* by:

1) Automatically detecting the renames, and recording it as such.
2) Detecting OS renaming and hacking in a compatibility layer. Akin to
the "line ending" debate. Only in this case we have a layer which
translates filenames instead of contents. It can be installed only when
on the Mac platform, which means all the other platforms don't have to
have the workarounds.


John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjKlLkACgkQJdeBCYSNAAOhQwCfQctGa07IysjYlPbORhOTHmmL
+aMAoIr+Yg09ABTJOgmDZeYYvOyPss4u
=GnZd
-----END PGP SIGNATURE-----



More information about the bazaar mailing list