Strip incompatible characters from Windows partitions!

Fri May 16 18:43:02 UTC 2008

Andrew Sayers wrote:
> Because there are no proper standards for Windows filesystems, there's
> no common agreement about how to turn the string of bytes that make up a
> FAT filename into a string of characters.  For example, a Japanese
> computer might look at a filesystem and assume that all the files are
> encoded in SHIFT-JIS, while a Western European computer might look at
> the same filesystem and assume that all the files are encoded in code
> page 1252.

    There actually is a well defined standard for FAT filesystems.
Further, I believe that FAT16 (and earlier) are now open, and there
are a number of implementations available under various licenses, so
it's not necessarily just a Windows issue.  For reference, codepage
932 is used for Japanese, and is actually slightly different than
Shift-JIS, in that it includes codepoints for some extra NEC
characters.  Further, codepage 437 is likely at least as common as
codepage 1252 for interoperability concerns.

    The specific issue is that the FAT filesystem itself doesn't
encode any metadata specifying what codepage is used for the files,
and more interestingly, does not consider it an error if files are
stored with names in more than one codepage (although this may cause
issues for programs reading from the filesystem, depending on their
configuration).

> Most irritatingly, FAT filenames can use single-byte encodings (like
> ASCII), multi-byte (like UTF-8), or double-byte (like UTF-16).  This
> means that a filename might be valid ASCII (perhaps including some
> "disallowed" ASCII characters, perhaps not), but which would be garbled
> nonsense if interpreted as such.

    More precisely, FAT allows for two types of representational
characters, which definition depends on the codepage in use at the
time a file is read or written.  Any given character may be encoded in
either SBCS (8-bit) or DBCS (16-bit) format, and the interpretation of
the appropriate glyph to associate with a codepoint is determined by
the codepage in use at the time of the read or write.  These formats
are substantially different from UTF-8 or UCS-2, although they share
some similarities with each (again, depending on the codepage, but
generally).  Note that these representations are only used for file &
directory names, and the files themselves are not necessarily so
encoded.

> Disallowed characters aren't so much a Windows kernel issue as a
> pervasive Windows UI issue.  The exception that proves the rule is Emacs
> on Windows.  Emacs being Emacs, it pays little attention to the
> conventions of young upstarts like Microsoft, so can handle files with
> funnily-named characters just fine.

    emacs isn't a unique exception.  Anything capable of editing the
directory entry directly (vim, your favorite hex editor, midnight
commander, etc.) can rename the files.  The issue is more that the
standard file access libraries will have difficulties (regardless of
operating system), as these make assumptions about the meaning of
characters provided as arguments (e.g. '\'), rather than considering
them as a raw bytestream.  Further, while emacs can rename the files
in dired mode, I don't believe it can open the files until they have
been renamed (although I don't have the necessary software to
substantiate this belief).

-- 
Emmet HIKORY