Filesystem paths

John Arbash Meinel john at arbash-meinel.com
Thu Apr 27 14:22:56 BST 2006


Martin Pool wrote:

...

>> So the non-safety of round trips is probably enough for me to accept
>> that we need to use URLs.
> 
> Me too - it hadn't quite clicked until Robert joined up all the dots for
> me.

So for right now, we will still have issues with paths that don't
properly decode, because we still unescape the *whole* path, and not
just portions of it. But at least we have some of the framework in place.

>>
>> I'm still concerned about a couple of other edge cases, though.
>>
>> Specifically, if I manually type in a path, it is going to be a string
>> encoded by the local encoding. This includes both local paths, and sftp
>> paths (in my mind), and *maybe* http paths.
> 
> It's kind of interesting because the stdin/out encoding may be different
> from the filesystem encoding.
> 
> The (vague) standard on modern unix seems to be that byte strings stored
> in the filesystem are to be interpreted as UTF-8 encodings, but the user
> has a choice of terminal encodings.  (I'm interested to hear from our
> users about whether this is actually true for them.)  But even then,
> stdin/stdout seems to be generally convering on UTF-8.
> 
> It's not at all clear to me whether arguments on the command line should
> be assumed to be in the filesystem encoding or in the locale encoding,
> if the two differ.  I suppose they're in the locale encoding.
> 
>    http://www.gtk.org/gtk-2.0.0-notes.html
> 
> File urls are probably actually a very reasonable way to enter paths
> that are hard to type - the presence of the file: scheme indicates the
> rest is %-escaped.  Although at present they're not much use for files
> inside a working directory, and they have to be absolute...

That is what I ended up with. file:// are always URL's and always
absolute paths.

path/to/foo is decoded as normal by bzr (since it is an argument). I
believe bzr uses user_encoding for this.

> 
>> But more likely is that I'm going to cut & paste an http path (either
>> from my browser or as a link in an email). And these will be actual URLs.
> 
> If you copy (with at least Epiphany and Safari) they seem to be
> %-escaped, which is quite reasonable.

And Firefox.

> 
>> Anyway, my concern is that users are going to enter strings which may be
>> urls, or may be unicode strings. I really don't think we want to require
>> them to always enter urls, because it is a real pain to have to escape
>> them when you are just referring to a local file.
>> It might be okay to require them to translate the URL into a unicode
>> string, but as earlier stated, that is not defined to actually work.
>>
>> My feeling is that we should treat everything except http as a unicode
>> url, and an http:// string as a real url.
>>
>> Alternatively, we treat plain paths as unicode, and anything that starts
>> with foo:// needs to be a real url. I suppose that is the most
>> consistent, but it means I can't do:
>>
>> ssh host
>> ls
>> copy + paste => sftp://host/path
> 
> OK, so how about these rules for handling paths/urls from the user:
> 
>  If there is no URL scheme, they are filenames.  Filenames are assumed
> to be encoded in the locale encoding.  They can be decoded to Unicode.
> 
>  To form the URL for a local file, we encode it into the
> filesystemencoding and then escape that.

I was encoding directly to utf-8. Does it make more sense to have the
URL be filesystemencoded?

What characters are valid in filesystem-encoding that wouldn't be valid
utf-8? I know there are byte-sequences, but if we have already decoded
the path into Unicode, it seems that utf-8 is a safer internal format.

I suppose there is an issue that the user would have to do the
translation into unicode and back to utf-8 to be able to type the
file://latin-1/with128-255chars/path

> 
>  If there is a scheme, the string is a special "url with unicode".  It
> can already have escapes.  If there are any non-ascii characters, they
> are encoded using the locale encoding and then escaped.  This gives a
> proper (only ascii) url.  (I think that rule is close to what web
> browsers do with their url field.)

From what I've seen the URL field is utf-8 encoded + escaped. Which is
what I was trying to do internally.

>>
>> I'm not sure what Aaron is defining as "POSIX" interface. What would
>> make TestCase.build_tree() a POSIX interface?
> 
> I think he meant that it would use os.mkdir, file(), etc directly,
> rather than going through the Transport.
> 
> --Martin


The reason for the transport, is because then build_tree can actually do
 the build over sftp. (Which it does in a couple of instances).

I think it gives our Transport stuff a decent workout.

Also, because of how I had to do the URL changes, I think I made "bzr
branch" able to create remote branches (as long as they are in a shared
repo with no working trees).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060427/9a788276/attachment.pgp 


More information about the bazaar mailing list