Filesystem paths

John Arbash Meinel john at arbash-meinel.com
Wed Apr 26 15:02:32 BST 2006


Martin Pool wrote:
> On 26/04/2006, at 1:53 PM, Aaron Bentley wrote:
> 
>> I don't know Robert's reasons, but the reason I like the transport layer
>> being all-url is because some transports *must* be url-based, and all
>> transports *can* be url-based.  It keeps the layer simple, promotes code
>> reuse, and all that good stuff.
> 
> Here's a patch for the developer documentation which tries to make it
> clear *why* they must be like this.  Is it clear/correct?  (Thanks to
> Robert for helping me get it straight.)
> 


...

> +The main reason for this is that it's not possible to safely roundtrip a
> +URL into Unicode and then back into the same URL.  The URL standard
> +gives a way to represent non-ASCII bytes in ASCII (as %-escapes), but
> +doesn't say how those bytes represent non-ASCII characters.  (They're not
> +guaranteed to be UTF-8 -- that is common but doesn't happen everywhere.)
> +

So the non-safety of round trips is probably enough for me to accept
that we need to use URLs.

I'm still concerned about a couple of other edge cases, though.

Specifically, if I manually type in a path, it is going to be a string
encoded by the local encoding. This includes both local paths, and sftp
paths (in my mind), and *maybe* http paths.

But more likely is that I'm going to cut & paste an http path (either
from my browser or as a link in an email). And these will be actual URLs.

The sftp case could go either way. I believe the sftp spec says that
paths are byte blobs, just like Unix, and so doesn't enforce anything
except a lack of null characters.

I also don't know how ftp works. I have the feeling that it is an old
enough spec it could be very implementation dependent.
In my one test of an ftp server, I got back:
>>> f.list_dir('jfmeinel/')
['\xd8\xac\xd9\x88\xd8\xac\xd9\x8a.txt']

Which is a utf-8 string.

I also tried the sftp transport, and it returned this:
>>> s.list_dir('jfmeinel/')
[u'\u062c\u0648\u062c\u0648.txt']

So right now, sftp returns a unicode string, while ftp returns the raw
byte stream.

Now, ftp fails to actually get that file, because it tries to mutter(%s)
with a non-ascii string.

Anyway, my concern is that users are going to enter strings which may be
urls, or may be unicode strings. I really don't think we want to require
them to always enter urls, because it is a real pain to have to escape
them when you are just referring to a local file.
It might be okay to require them to translate the URL into a unicode
string, but as earlier stated, that is not defined to actually work.

My feeling is that we should treat everything except http as a unicode
url, and an http:// string as a real url.

Alternatively, we treat plain paths as unicode, and anything that starts
with foo:// needs to be a real url. I suppose that is the most
consistent, but it means I can't do:

ssh host
ls
copy + paste => sftp://host/path

Honestly, unicode urls aren't a big deal for *me*. I don't have any need
right now. I will probably need to encode unicode filenames, but I can
post those somewhere with only ascii characters.

Any thoughts?

...

> So LocalTransport.abspath shouldn't be calling osutils.abspath, but
> rather should be manipulating URL objects?  Then we can see that
> 
>   file:///c|/
> 
> has no "up"?

I'm not sure what Aaron meant by local semantics for non url paths. As
long as we only allow '/' as the path delimiter, I don't see how path
handling is all that different.
I know there is an issue with ':'. And some questions about 'isabs'
requiring a beginning slash versus requiring a beginning drive letter +
':/'. Though there is the builtin 'os.path.isabs' which would handle
that for us.

> 
>> Since users will rarely pass in URL for filesystem paths, we should have
>> a function that converts user paths unto URLs (if they're not already).
>>  Quite possibly get_transport should do that.
> 
> Yes.

I would tend to agree. Since I think the passed in paths would depend on
the transport to decide whether they were URLs or unicode paths.

> 
>> OTOH, I don't think it's appropriate to be using transports to access
>> working trees, and since that's the bug you encountered, I suggest
>> that's what we should fix-- build_tree should either be implemented in
>> terms of POSIX, or it should translate paths to urls before using them
>> with Transport.
> 
> Can you tell me more about why it's not appropriate?  Is it because
> Transports should focus on supporting just what is needed for control
> file access?
> 
> --Martin

I'm not sure what Aaron is defining as "POSIX" interface. What would
make TestCase.build_tree() a POSIX interface?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060426/4d0f7702/attachment.pgp 


More information about the bazaar mailing list