File names normalization

Vincent Ladeuil v.ladeuil+lp at free.fr
Fri Sep 12 07:18:56 BST 2008


>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> ...

    >> So, I'll summarize as follows:
    >> 
    >> - internally, bzr represent file names as NFC normalized, utf-8
    >> encoded strings,

    john>   - at one point, bzr internally represented file names as NFC
    john>     normalized, Unicode strings.

Too bad.

    john> I've somewhat given up on that because of people who
    john> want mixed strings, and it fighting with people who
    john> want automatic normalization between platforms.

If people wanted un-normalized strings... well, there is little
we can do...

    john> I we don't force normalization on Win32 and Linux, but
    john> *do* enforce it on Mac (because we know Mac "renames"
    john> stuff), then *sometimes* people will commit
    john> non-normalized filenames on Windows, and then when they
    john> check it out on Mac it will appear renamed.

Indeed. I can't see how to address that since there is an
inherent contradiction.

Either we have a canonical representation (NFC) and we may need
to rename files (but people should not be able to visually
notice[1]) or we don't. In the later case, we can't guarantee
that the files can be checked out on Mac.

I view the later case as somewhat similar to the case sensitive
issues because it allows two different files with the same NFC
form to coexist.

    john> Compare that to never normalizing, where people check
    john> in NFC names on Linux, and then we re-normalize for
    john> Mac. Probably we would get *more* cases correct, but we
    john> still lose edge cases.

Yes, that's why I would prefer to always normalize (whatever
canonical form we chose).

    >> 
    >> - it checks, during add and rename that files are in a canonical
    >> form,
    >> 
    >> - using non-NFC encoding file systems requires transcoding at
    >> various points.
    >> 
    >> The last point being the one that needs an unknown amount of work
    >> to be 100% correct.
    >> 

    john> And hasn't been proven to be a win for the "majority"
    john> of users. (Why force NFC encoding on Windows/Linux for
    john> people who may never check the code out on Mac? etc)

Can you remind me why it was a loss (except for [1]) ?

    john> My current idea was to just reduce the internal
    john> checking, let people version any Unicode name they
    john> want. And let people on Mac suffer the consequences of
    john> a filesystem that doesn't play the same rules everyone
    john> else does.

Hmm, it seems you went to the dark side ! :D

    john> I just didn't get far enough through to ripping out all
    john> of the normalization checks.

Don't :)

    john> We should do a proper fix and audit the code. It just
    john> hasn't been a high priority. Also, I haven't found
    john> anyone else particularly interested in the topic.

We first need to clearly define the rules.

    john> And *I* don't version non-ascii filenames at this time,
    john> nor do I have a habit of checking them out on other
    john> platforms. (Mac used to be my primary laptop, but is
    john> not anymore.)

Indeed, you went to the dark side :)

        Vincent



More information about the bazaar mailing list