Filesystem paths
Martin Pool
mbp at sourcefrog.net
Thu Apr 27 04:39:48 BST 2006
On 27/04/2006, at 12:02 AM, John Arbash Meinel wrote:
>> +The main reason for this is that it's not possible to safely
>> roundtrip a
>> +URL into Unicode and then back into the same URL. The URL standard
>> +gives a way to represent non-ASCII bytes in ASCII (as %-escapes),
>> but
>> +doesn't say how those bytes represent non-ASCII characters.
>> (They're not
>> +guaranteed to be UTF-8 -- that is common but doesn't happen
>> everywhere.)
>> +
>
> So the non-safety of round trips is probably enough for me to accept
> that we need to use URLs.
Me too - it hadn't quite clicked until Robert joined up all the dots
for me.
>
> I'm still concerned about a couple of other edge cases, though.
>
> Specifically, if I manually type in a path, it is going to be a string
> encoded by the local encoding. This includes both local paths, and
> sftp
> paths (in my mind), and *maybe* http paths.
It's kind of interesting because the stdin/out encoding may be
different from the filesystem encoding.
The (vague) standard on modern unix seems to be that byte strings
stored in the filesystem are to be interpreted as UTF-8 encodings,
but the user has a choice of terminal encodings. (I'm interested to
hear from our users about whether this is actually true for them.)
But even then, stdin/stdout seems to be generally convering on UTF-8.
It's not at all clear to me whether arguments on the command line
should be assumed to be in the filesystem encoding or in the locale
encoding, if the two differ. I suppose they're in the locale encoding.
http://www.gtk.org/gtk-2.0.0-notes.html
File urls are probably actually a very reasonable way to enter paths
that are hard to type - the presence of the file: scheme indicates
the rest is %-escaped. Although at present they're not much use for
files inside a working directory, and they have to be absolute...
> But more likely is that I'm going to cut & paste an http path (either
> from my browser or as a link in an email). And these will be actual
> URLs.
If you copy (with at least Epiphany and Safari) they seem to be %-
escaped, which is quite reasonable.
> Anyway, my concern is that users are going to enter strings which
> may be
> urls, or may be unicode strings. I really don't think we want to
> require
> them to always enter urls, because it is a real pain to have to escape
> them when you are just referring to a local file.
> It might be okay to require them to translate the URL into a unicode
> string, but as earlier stated, that is not defined to actually work.
>
> My feeling is that we should treat everything except http as a unicode
> url, and an http:// string as a real url.
>
> Alternatively, we treat plain paths as unicode, and anything that
> starts
> with foo:// needs to be a real url. I suppose that is the most
> consistent, but it means I can't do:
>
> ssh host
> ls
> copy + paste => sftp://host/path
OK, so how about these rules for handling paths/urls from the user:
If there is no URL scheme, they are filenames. Filenames are
assumed to be encoded in the locale encoding. They can be decoded to
Unicode.
To form the URL for a local file, we encode it into the
filesystemencoding and then escape that.
If there is a scheme, the string is a special "url with unicode".
It can already have escapes. If there are any non-ascii characters,
they are encoded using the locale encoding and then escaped. This
gives a proper (only ascii) url. (I think that rule is close to what
web browsers do with their url field.)
>
> I'm not sure what Aaron is defining as "POSIX" interface. What would
> make TestCase.build_tree() a POSIX interface?
I think he meant that it would use os.mkdir, file(), etc directly,
rather than going through the Transport.
--
Martin
More information about the bazaar
mailing list