"Using Saved Location: foo"

John Arbash Meinel john at arbash-meinel.com
Mon May 1 16:21:11 BST 2006


Martin Pool wrote:
> On 28/04/2006, at 11:42 PM, John Arbash Meinel wrote:
>>
>> I chose to try and split it up per hierarchy component. So my algorithm
>> is something like:
>>
>> split on '/'
>>     split chunk on '%'
>>         expand safe escapes (all but the unsafe ones)
>>     test_chunk = join unescaped hunks
>>     try
>>         chunk = test_chunk.decode(utf-8)
>>     except UnicodeDecodeError:
>>         # leave chunk alone
>>
>> join chunks with '/'
>>
>> This gives the option that if you have a non-utf-8 portion of your url,
>> the rest of the URL is still decoded properly.
>>
>> I did this, because files under bzr control should have utf-8
>> representation, but we don't control above that.
> 
> OK that looks good.
> 
> We might eventually want IDNA display of domain names to be handled
> separately, but that can be done later.

I've been thinking about it. And is there a reason not to allow Unicode
pseudo-URLs?

Basically, I'm thinking to use the inverse of "urlfordisplay()". If it
comes in as unicode, escape all unicode characters and turn it into a
URL. You can still detect URL/not URL because of the presence of '://'.
If you want to be extra cautious you could say '://' not preceded by any
other slash. (though really the accidental case is someone typing
C://foo, which isn't solved by the simplification).

It seems that this would let people type:

sftp://host/path/to/日本人

Instead of having to type:
sftp://host/path/to/%E6%97%A5%E6%9C%AC%E4%BA%BA

Maybe a more obvious one would be:
sftp://host/path/to/bzr/bågfors

versus
sftp://host/path/to/bzr/b%C3%A5gfors

I think if the "url" is unicode, there still is no ambiguity. We know if
a path is a URL because of the prefix. As long as we only escape Unicode
that cannot be converted to ascii, I think we would be okay.

So something like:
if unicode:
  split on '/':
    try:
      chunk = chunk.encode('ascii')
    except UnicodeEncodeError:
      chunk = urlescape(chunk)
  join chunks with '/'

We would still have a problem if somebody mixed their stuff, like doing:
sftp://host/path/to/日本%E4%BA%BA

so that that specific chunk has both unicode and URL escaping. But it
would succeed if they did:
sftp://host/path/to/b%C3%A5gfors/日本人

> 
>>> Can I suggest having a:
>>> class url(string):
>>>     ...
>>> ?
>>
>> *shudder*
>>
>> I think I can see your point. But I've seen mostly pain when people try
>> to inherit from string.
> 
> Perhaps URL(object), but with a __str__ method that allows it to be used
> in place?
> 
>>> Then various places in the code can assert that they really have and
>>> url (I'd include relative-url-reference) in it.
>>
>> The other problem is that user input isn't a URL (yet). It may actually
>> be a URL, but before it hits Transport, it is just a Unicode string.
> 
> I think we would want separate factory methods: one comes from a
> strictly ascii, properly encoded URL, and raises if it's not correct. 
> The other accepts Unicode input from the user and applies the heuristics
> in this thread, raising only if it's impossible to work out what they
> mean.  And yet another that turns local paths into file urls.

Sounds reasonable to me.

> 
> Robert has expressed interest in writing a properly standards-compliant
> URL module for Python -- apparently none of the existing ones are
> strictly correct.
> 
>> Maybe if we inherited from string, but didn't actually try to add any
>> members, it might be okay as a debugging tool. It seems like it would
>> make the Transport api harder to use. Since now it not only expects
>> valid url fragments, but it requires them to be "url()" instances.
> 
> True - so would methods accept both, or would we take strings for
> fragments, but whole URLs when they are needed?
> 
> --Martin

The only advantage for URL to me at the moment, is that they are a
separate type, so you can put "assert isinstance(url, URL)" in all of
the Transport methods.
I suppose you could create a URL class which would keep the protocol
separate from the path, etc. But I think paths as strings is the best
api. It just makes it easy to use. Especially in an interactive
interpreter when you are trying to debug something.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060501/de7a2c95/attachment.pgp 


More information about the bazaar mailing list