Strip incompatible characters from Windows partitions!

Andrew Sayers andrew-ubuntu-devel at pileofstuff.org
Fri May 16 16:59:42 UTC 2008


This e-mail summarises a discussion in #ubuntu-motu between myself,
ScottK and persia.  I'll first explain the general problem, then suggest
a messy solution to a surprisingly messy problem.  Most of these ideas
are not my own, and in fact had to be explained to me at some length, so
please don't assume that I know what I'm talking about ;)

Since there wasn't an NTFS expert available during the conversation,
it's possible that the following is only true of FAT filesystems.

Characters like '&' and '/' are in fact just the tip of the iceberg -
see https://bugs.launchpad.net/ubuntu/+source/dosfstools/+bug/49217 for
another way that the same problem can bite you.

Because there are no proper standards for Windows filesystems, there's
no common agreement about how to turn the string of bytes that make up a
FAT filename into a string of characters.  For example, a Japanese
computer might look at a filesystem and assume that all the files are
encoded in SHIFT-JIS, while a Western European computer might look at
the same filesystem and assume that all the files are encoded in code
page 1252.

Most irritatingly, FAT filenames can use single-byte encodings (like
ASCII), multi-byte (like UTF-8), or double-byte (like UTF-16).  This
means that a filename might be valid ASCII (perhaps including some
"disallowed" ASCII characters, perhaps not), but which would be garbled
nonsense if interpreted as such.

The above problems make automatically detecting the character encoding
of files in a FAT filesystem at best hard and sometimes impossible.
Therefore, there's no general way to tell whether '&', '/' etc. are
valid characters in a given file in a FAT file.  Even if there were a
way to work out which characters are allowed, ext2-on-Windows drivers
make it possible to have files with disallowed characters in a Windows
system.

Disallowed characters aren't so much a Windows kernel issue as a
pervasive Windows UI issue.  The exception that proves the rule is Emacs
on Windows.  Emacs being Emacs, it pays little attention to the
conventions of young upstarts like Microsoft, so can handle files with
funnily-named characters just fine.

Given the above, my suggestion is that there ought to be a tool that
runs identically in Windows and Linux that interactively converts files.
 It would ask for an initial encoding, target encoding, and target path,
then recurse through all the directories rooted in that path,
translating files as it goes.  Characters that are valid but tend to
cause headaches could be automatically converted, or the user could be
prompted for a better name.  Most of the actual work in this program can
be done by iconv, although it might be worth having a punycode mode that
minimises incompatibility at the expense of readability.  Finally, I
would suggest that the Windows version be run straight from the Ubuntu
CD, rather than being made available from some website somewhere.  As
well as making the program a little bit easier to find, it makes a great
advert for Linux - it solves the problems that Windows causes.

	- Andrew




More information about the Ubuntu-devel-discuss mailing list