Filesystem paths

Martin Pool mbp at sourcefrog.net
Thu Apr 27 04:39:48 BST 2006


On 27/04/2006, at 12:02 AM, John Arbash Meinel wrote:

>> +The main reason for this is that it's not possible to safely  
>> roundtrip a
>> +URL into Unicode and then back into the same URL.  The URL standard
>> +gives a way to represent non-ASCII bytes in ASCII (as %-escapes),  
>> but
>> +doesn't say how those bytes represent non-ASCII characters.   
>> (They're not
>> +guaranteed to be UTF-8 -- that is common but doesn't happen  
>> everywhere.)
>> +
>
> So the non-safety of round trips is probably enough for me to accept
> that we need to use URLs.

Me too - it hadn't quite clicked until Robert joined up all the dots  
for me.
>
> I'm still concerned about a couple of other edge cases, though.
>
> Specifically, if I manually type in a path, it is going to be a string
> encoded by the local encoding. This includes both local paths, and  
> sftp
> paths (in my mind), and *maybe* http paths.

It's kind of interesting because the stdin/out encoding may be  
different from the filesystem encoding.

The (vague) standard on modern unix seems to be that byte strings  
stored in the filesystem are to be interpreted as UTF-8 encodings,  
but the user has a choice of terminal encodings.  (I'm interested to  
hear from our users about whether this is actually true for them.)   
But even then, stdin/stdout seems to be generally convering on UTF-8.

It's not at all clear to me whether arguments on the command line  
should be assumed to be in the filesystem encoding or in the locale  
encoding, if the two differ.  I suppose they're in the locale encoding.

    http://www.gtk.org/gtk-2.0.0-notes.html

File urls are probably actually a very reasonable way to enter paths  
that are hard to type - the presence of the file: scheme indicates  
the rest is %-escaped.  Although at present they're not much use for  
files inside a working directory, and they have to be absolute...

> But more likely is that I'm going to cut & paste an http path (either
> from my browser or as a link in an email). And these will be actual  
> URLs.

If you copy (with at least Epiphany and Safari) they seem to be %- 
escaped, which is quite reasonable.

> Anyway, my concern is that users are going to enter strings which  
> may be
> urls, or may be unicode strings. I really don't think we want to  
> require
> them to always enter urls, because it is a real pain to have to escape
> them when you are just referring to a local file.
> It might be okay to require them to translate the URL into a unicode
> string, but as earlier stated, that is not defined to actually work.
>
> My feeling is that we should treat everything except http as a unicode
> url, and an http:// string as a real url.
>
> Alternatively, we treat plain paths as unicode, and anything that  
> starts
> with foo:// needs to be a real url. I suppose that is the most
> consistent, but it means I can't do:
>
> ssh host
> ls
> copy + paste => sftp://host/path

OK, so how about these rules for handling paths/urls from the user:

  If there is no URL scheme, they are filenames.  Filenames are  
assumed to be encoded in the locale encoding.  They can be decoded to  
Unicode.

  To form the URL for a local file, we encode it into the  
filesystemencoding and then escape that.

  If there is a scheme, the string is a special "url with unicode".   
It can already have escapes.  If there are any non-ascii characters,  
they are encoded using the locale encoding and then escaped.  This  
gives a proper (only ascii) url.  (I think that rule is close to what  
web browsers do with their url field.)
>
> I'm not sure what Aaron is defining as "POSIX" interface. What would
> make TestCase.build_tree() a POSIX interface?

I think he meant that it would use os.mkdir, file(), etc directly,  
rather than going through the Transport.

-- 
Martin







More information about the bazaar mailing list