Filesystem paths
Jan Hudec
bulb at ucw.cz
Thu Apr 27 06:25:25 BST 2006
On Thu, Apr 27, 2006 at 13:39:48 +1000, Martin Pool wrote:
> On 27/04/2006, at 12:02 AM, John Arbash Meinel wrote:
>
> >>+The main reason for this is that it's not possible to safely
> >>roundtrip a
> >>+URL into Unicode and then back into the same URL. The URL standard
> >>+gives a way to represent non-ASCII bytes in ASCII (as %-escapes),
> >>but
> >>+doesn't say how those bytes represent non-ASCII characters.
> >>(They're not
> >>+guaranteed to be UTF-8 -- that is common but doesn't happen
> >>everywhere.)
> >>+
> >
> >So the non-safety of round trips is probably enough for me to accept
> >that we need to use URLs.
>
> Me too - it hadn't quite clicked until Robert joined up all the dots
> for me.
> >
> >I'm still concerned about a couple of other edge cases, though.
> >
> >Specifically, if I manually type in a path, it is going to be a string
> >encoded by the local encoding. This includes both local paths, and
> >sftp
> >paths (in my mind), and *maybe* http paths.
>
> It's kind of interesting because the stdin/out encoding may be
> different from the filesystem encoding.
>
> The (vague) standard on modern unix seems to be that byte strings
> stored in the filesystem are to be interpreted as UTF-8 encodings,
> but the user has a choice of terminal encodings. (I'm interested to
> hear from our users about whether this is actually true for them.)
> But even then, stdin/stdout seems to be generally convering on UTF-8.
I don't think it's unix standard -- it's a standard Gnome is trying to push
-- and that does not seem to be accepted beyond Gnome. The problem is, that
no standard unix tool supports different filesystem and stdin/stdout encoding
(except vim, but I'd have to try how it converts what then).
All Linux distributions are moving to using UTF-8 for everything, which is
quite a few bits easier, since most tools keep not needing to know about
character sets.
> It's not at all clear to me whether arguments on the command line
> should be assumed to be in the filesystem encoding or in the locale
> encoding, if the two differ. I suppose they're in the locale encoding.
>
> http://www.gtk.org/gtk-2.0.0-notes.html
>
> File urls are probably actually a very reasonable way to enter paths
> that are hard to type - the presence of the file: scheme indicates
> the rest is %-escaped. Although at present they're not much use for
> files inside a working directory, and they have to be absolute...
We could add an extension:
file:relative/path
(ie. NOT starting with // -- file://relative/path is wrong, because the
'relative' would mean host, whatever that means for file: protocol).
> >But more likely is that I'm going to cut & paste an http path (either
> >from my browser or as a link in an email). And these will be actual
> >URLs.
>
> If you copy (with at least Epiphany and Safari) they seem to be %-
> escaped, which is quite reasonable.
Well, among other things because there is no thing like unescaped URL. If you
unescape the whole URL (as opposed to path, query string, options etc.
separately) it stops making sense, because some of the escapes may expand to
special characters.
> >Anyway, my concern is that users are going to enter strings which
> >may be
> >urls, or may be unicode strings. I really don't think we want to
> >require
> >them to always enter urls, because it is a real pain to have to escape
> >them when you are just referring to a local file.
> >It might be okay to require them to translate the URL into a unicode
> >string, but as earlier stated, that is not defined to actually work.
> >
> >My feeling is that we should treat everything except http as a unicode
> >url, and an http:// string as a real url.
> >
> >Alternatively, we treat plain paths as unicode, and anything that
> >starts
> >with foo:// needs to be a real url. I suppose that is the most
> >consistent, but it means I can't do:
> >
> >ssh host
> >ls
> >copy + paste => sftp://host/path
>
> OK, so how about these rules for handling paths/urls from the user:
>
> If there is no URL scheme, they are filenames. Filenames are
> assumed to be encoded in the locale encoding. They can be decoded to
> Unicode.
>
> To form the URL for a local file, we encode it into the
> filesystemencoding and then escape that.
>
> If there is a scheme, the string is a special "url with unicode".
> It can already have escapes. If there are any non-ascii characters,
> they are encoded using the locale encoding and then escaped. This
> gives a proper (only ascii) url. (I think that rule is close to what
> web browsers do with their url field.)
Yes. I support this.
On a side-note I just tried in konqueror and the way it recodes the thing in
URL bar really eludes me.
> >I'm not sure what Aaron is defining as "POSIX" interface. What would
> >make TestCase.build_tree() a POSIX interface?
>
> I think he meant that it would use os.mkdir, file(), etc directly,
> rather than going through the Transport.
--
Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060427/81f620f0/attachment.pgp
More information about the bazaar
mailing list