"Using Saved Location: foo"

Tue May 2 04:24:45 BST 2006

On 02/05/2006, at 1:21 AM, John Arbash Meinel wrote:

> Martin Pool wrote:
>> On 28/04/2006, at 11:42 PM, John Arbash Meinel wrote:
>>>
>>> I chose to try and split it up per hierarchy component. So my  
>>> algorithm
>>> is something like:
>>>
>>> split on '/'
>>>     split chunk on '%'
>>>         expand safe escapes (all but the unsafe ones)
>>>     test_chunk = join unescaped hunks
>>>     try
>>>         chunk = test_chunk.decode(utf-8)
>>>     except UnicodeDecodeError:
>>>         # leave chunk alone
>>>
>>> join chunks with '/'
>>>
>>> This gives the option that if you have a non-utf-8 portion of  
>>> your url,
>>> the rest of the URL is still decoded properly.
>>>
>>> I did this, because files under bzr control should have utf-8
>>> representation, but we don't control above that.
>>
>> OK that looks good.
>>
>> We might eventually want IDNA display of domain names to be handled
>> separately, but that can be done later.
>
> I've been thinking about it. And is there a reason not to allow  
> Unicode
> pseudo-URLs?

I think we should accept them.
>
> Basically, I'm thinking to use the inverse of "urlfordisplay()". If it
> comes in as unicode, escape all unicode characters and turn it into a
> URL. You can still detect URL/not URL because of the presence of  
> '://'.
> If you want to be extra cautious you could say '://' not preceded  
> by any
> other slash. (though really the accidental case is someone typing
> C://foo, which isn't solved by the simplification).
>
> It seems that this would let people type:
>
> sftp://host/path/to/日本人
>
> Instead of having to type:
> sftp://host/path/to/%E6%97%A5%E6%9C%AC%E4%BA%BA
>
> Maybe a more obvious one would be:
> sftp://host/path/to/bzr/bågfors

Yes, I agree -- that's what I was trying to communicate before.

> versus
> sftp://host/path/to/bzr/b%C3%A5gfors
>
> I think if the "url" is unicode, there still is no ambiguity. We  
> know if
> a path is a URL because of the prefix. As long as we only escape  
> Unicode
> that cannot be converted to ascii, I think we would be okay.
>
> So something like:
> if unicode:
>   split on '/':
>     try:
>       chunk = chunk.encode('ascii')
>     except UnicodeEncodeError:
>       chunk = urlescape(chunk)

Actually it's going to be

   chunk = urlescape(chunk.encode('utf-8'))

There is still an ambiguity here - we are *assuming* that their  
server treats URLs as being in UTF-8, but we don't know that.  It  
could very well use 8859-1 or any other national language encoding.   
Such servers will have URLs that can't be correctly roundtripped to  
the display form and back.

I think it is a reasonable assumption for most cases.  In places  
where this does not work, people can give the fully escaped form and  
it will be OK.

Because of this ambiguity I think we should use this unicode pseudo- 
urls only in defined layers that talk to the user.  We shouldn't use  
them internally or store them in files.

> We would still have a problem if somebody mixed their stuff, like  
> doing:
> sftp://host/path/to/日本%E4%BA%BA

It seems like we ought to be able to handle that.  Each Kanji  
character can be individually translated to UTF-8, and then to url  
escapes.

I don't understand why the conversion has to be done one path  
component at a time.  To translate from pseudo-URLs to real URLs I'd  
have something like this:

   for c in pseudo_url:
     if c in url_safe_characters:
       r += c
     else:
       if isinstance(c, unicode):
         r += urlescape(c.encode('utf-8'))

This also takes care of URLs containing ascii characters that must be  
escaped, such as space.

-- 
Martin