"Using Saved Location: foo"
John Arbash Meinel
john at arbash-meinel.com
Mon May 1 16:21:11 BST 2006
Martin Pool wrote:
> On 28/04/2006, at 11:42 PM, John Arbash Meinel wrote:
>>
>> I chose to try and split it up per hierarchy component. So my algorithm
>> is something like:
>>
>> split on '/'
>> split chunk on '%'
>> expand safe escapes (all but the unsafe ones)
>> test_chunk = join unescaped hunks
>> try
>> chunk = test_chunk.decode(utf-8)
>> except UnicodeDecodeError:
>> # leave chunk alone
>>
>> join chunks with '/'
>>
>> This gives the option that if you have a non-utf-8 portion of your url,
>> the rest of the URL is still decoded properly.
>>
>> I did this, because files under bzr control should have utf-8
>> representation, but we don't control above that.
>
> OK that looks good.
>
> We might eventually want IDNA display of domain names to be handled
> separately, but that can be done later.
I've been thinking about it. And is there a reason not to allow Unicode
pseudo-URLs?
Basically, I'm thinking to use the inverse of "urlfordisplay()". If it
comes in as unicode, escape all unicode characters and turn it into a
URL. You can still detect URL/not URL because of the presence of '://'.
If you want to be extra cautious you could say '://' not preceded by any
other slash. (though really the accidental case is someone typing
C://foo, which isn't solved by the simplification).
It seems that this would let people type:
sftp://host/path/to/日本人
Instead of having to type:
sftp://host/path/to/%E6%97%A5%E6%9C%AC%E4%BA%BA
Maybe a more obvious one would be:
sftp://host/path/to/bzr/bågfors
versus
sftp://host/path/to/bzr/b%C3%A5gfors
I think if the "url" is unicode, there still is no ambiguity. We know if
a path is a URL because of the prefix. As long as we only escape Unicode
that cannot be converted to ascii, I think we would be okay.
So something like:
if unicode:
split on '/':
try:
chunk = chunk.encode('ascii')
except UnicodeEncodeError:
chunk = urlescape(chunk)
join chunks with '/'
We would still have a problem if somebody mixed their stuff, like doing:
sftp://host/path/to/日本%E4%BA%BA
so that that specific chunk has both unicode and URL escaping. But it
would succeed if they did:
sftp://host/path/to/b%C3%A5gfors/日本人
>
>>> Can I suggest having a:
>>> class url(string):
>>> ...
>>> ?
>>
>> *shudder*
>>
>> I think I can see your point. But I've seen mostly pain when people try
>> to inherit from string.
>
> Perhaps URL(object), but with a __str__ method that allows it to be used
> in place?
>
>>> Then various places in the code can assert that they really have and
>>> url (I'd include relative-url-reference) in it.
>>
>> The other problem is that user input isn't a URL (yet). It may actually
>> be a URL, but before it hits Transport, it is just a Unicode string.
>
> I think we would want separate factory methods: one comes from a
> strictly ascii, properly encoded URL, and raises if it's not correct.
> The other accepts Unicode input from the user and applies the heuristics
> in this thread, raising only if it's impossible to work out what they
> mean. And yet another that turns local paths into file urls.
Sounds reasonable to me.
>
> Robert has expressed interest in writing a properly standards-compliant
> URL module for Python -- apparently none of the existing ones are
> strictly correct.
>
>> Maybe if we inherited from string, but didn't actually try to add any
>> members, it might be okay as a debugging tool. It seems like it would
>> make the Transport api harder to use. Since now it not only expects
>> valid url fragments, but it requires them to be "url()" instances.
>
> True - so would methods accept both, or would we take strings for
> fragments, but whole URLs when they are needed?
>
> --Martin
The only advantage for URL to me at the moment, is that they are a
separate type, so you can put "assert isinstance(url, URL)" in all of
the Transport methods.
I suppose you could create a URL class which would keep the protocol
separate from the path, etc. But I think paths as strings is the best
api. It just makes it easy to use. Especially in an interactive
interpreter when you are trying to debug something.
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060501/de7a2c95/attachment.pgp
More information about the bazaar
mailing list