"Using Saved Location: foo"
Martin Pool
mbp at sourcefrog.net
Tue May 2 04:24:45 BST 2006
On 02/05/2006, at 1:21 AM, John Arbash Meinel wrote:
> Martin Pool wrote:
>> On 28/04/2006, at 11:42 PM, John Arbash Meinel wrote:
>>>
>>> I chose to try and split it up per hierarchy component. So my
>>> algorithm
>>> is something like:
>>>
>>> split on '/'
>>> split chunk on '%'
>>> expand safe escapes (all but the unsafe ones)
>>> test_chunk = join unescaped hunks
>>> try
>>> chunk = test_chunk.decode(utf-8)
>>> except UnicodeDecodeError:
>>> # leave chunk alone
>>>
>>> join chunks with '/'
>>>
>>> This gives the option that if you have a non-utf-8 portion of
>>> your url,
>>> the rest of the URL is still decoded properly.
>>>
>>> I did this, because files under bzr control should have utf-8
>>> representation, but we don't control above that.
>>
>> OK that looks good.
>>
>> We might eventually want IDNA display of domain names to be handled
>> separately, but that can be done later.
>
> I've been thinking about it. And is there a reason not to allow
> Unicode
> pseudo-URLs?
I think we should accept them.
>
> Basically, I'm thinking to use the inverse of "urlfordisplay()". If it
> comes in as unicode, escape all unicode characters and turn it into a
> URL. You can still detect URL/not URL because of the presence of
> '://'.
> If you want to be extra cautious you could say '://' not preceded
> by any
> other slash. (though really the accidental case is someone typing
> C://foo, which isn't solved by the simplification).
>
> It seems that this would let people type:
>
> sftp://host/path/to/日本人
>
> Instead of having to type:
> sftp://host/path/to/%E6%97%A5%E6%9C%AC%E4%BA%BA
>
> Maybe a more obvious one would be:
> sftp://host/path/to/bzr/bågfors
Yes, I agree -- that's what I was trying to communicate before.
> versus
> sftp://host/path/to/bzr/b%C3%A5gfors
>
> I think if the "url" is unicode, there still is no ambiguity. We
> know if
> a path is a URL because of the prefix. As long as we only escape
> Unicode
> that cannot be converted to ascii, I think we would be okay.
>
> So something like:
> if unicode:
> split on '/':
> try:
> chunk = chunk.encode('ascii')
> except UnicodeEncodeError:
> chunk = urlescape(chunk)
Actually it's going to be
chunk = urlescape(chunk.encode('utf-8'))
There is still an ambiguity here - we are *assuming* that their
server treats URLs as being in UTF-8, but we don't know that. It
could very well use 8859-1 or any other national language encoding.
Such servers will have URLs that can't be correctly roundtripped to
the display form and back.
I think it is a reasonable assumption for most cases. In places
where this does not work, people can give the fully escaped form and
it will be OK.
Because of this ambiguity I think we should use this unicode pseudo-
urls only in defined layers that talk to the user. We shouldn't use
them internally or store them in files.
> We would still have a problem if somebody mixed their stuff, like
> doing:
> sftp://host/path/to/日本%E4%BA%BA
It seems like we ought to be able to handle that. Each Kanji
character can be individually translated to UTF-8, and then to url
escapes.
I don't understand why the conversion has to be done one path
component at a time. To translate from pseudo-URLs to real URLs I'd
have something like this:
for c in pseudo_url:
if c in url_safe_characters:
r += c
else:
if isinstance(c, unicode):
r += urlescape(c.encode('utf-8'))
This also takes care of URLs containing ascii characters that must be
escaped, such as space.
--
Martin
More information about the bazaar
mailing list