Unicode URLs (was Re: "Using Saved Location: foo")

John Arbash Meinel john at arbash-meinel.com
Tue May 2 15:02:53 BST 2006


Martin Pool wrote:
> On  2 May 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> Martin Pool wrote:
>>> On 02/05/2006, at 3:18 PM, Jan Hudec wrote:
>>>> Well, with a twist. The input url is actually 'encoded', so urlescape
>>>> must not escape % in this case.
>>> Right - actually I think doing it chunk-by-chunk is just going to
>>> complicate things.  I think the other algorithm I quoted would be better
>>> for handling these pseudo-URLs, since it doesn't touch things that might
>>> already be escaped:
>>>
>>>   for c in pseudo_url:
>>>     if c in url_safe_characters:
>>>       r += c
>>>     else:
>>>       if isinstance(c, unicode):
>>>         r += urlescape(c.encode('utf-8'))
>>>
>>> --Martin
>> Well, I think we need to decode for display by chunk, because if a chunk
>> doesn't decode by 'utf-8' then I think we should leave it escaped.
> 
> The question is really: if one path component can't be decoded as UTF-8,
> should you try to decode the others that way?  There are cases where
> that might be right(1), but others where it's wrong(2):
> 
>  (1) a server where user home directories are named in 8859-1, but
>      one user chooses to name their files in utf-8
> 
>  (2) a server where all paths are some codepage, but it just
>      coincidentally happens that one path component can be decoded
>      as garbage but valid UTF-8; it would be better not to decode 
>      at all

The specific instance I am thinking about is any server that uses
non-utf-8, but everything next to '.bzr' is going to be utf-8.

So far, all of the bzr control files are ascii. And I think we're
planning on keeping that (considering the lengths we're going with url
escaping).
However, bzr supports stuff like:

bzr log http://host/path/to/branch/file/in/branch

So while the server would decode:
http://host/path/to/branch/

bzr itself is decoding "file/in/branch"
And that section will always be utf-8 + urlquoting.

And I think it is worthwhile to try and print out as much of the url
that we can.

> 
>> Also, I think internally we should use the escaped form, and that is
>> what we save and send to the host. That way, if the host uses a
>> different escaping, the user can just manually escape it, rather than
>> typing it as unicode.
> 
> To be more precise - we should internally use the *proper* URL, not
> these heuristic kinda-urls.
> 
> Accepting input is easy, but displaying URLs as Unicode seems quite
> likely to break for people with non-UTF-8 servers.
> 

1) I believe that it is a rare case. (Though I don't have much to back
that up)

2) I'm not sure what you mean by *proper* URL. I think we might just be
talking past each-other. My plan is that if a user types in something
that is not 7-bit ASCII, I will treat it as unicode, and utf-8+urlescape
it. If they type in 7-bit ASCII, then I will not change it, other than
stuff like ' ' => %20.
At that point I have what I have determined to be the 'correct' URL. And
I will use it. If it turns out that the server doesn't use utf-8, then
the user is required to type the path as 7-bit ASCII URL. (Except for
the portion controlled by .bzr, since the local bzr interprets that
portion of the path).

3) It is rare that iso8859-1 will decode to utf-8. Because once you try
for a non 7-bit character, I believe utf-8 always requires at least 2
bytes. And it has special rules about what sets of bytes are allowed
next to eachother. (Both are 8-bit, with specific bits based on number
of characters, etc). So while anything can be decoded by iso-8859-1, a
lot of byte strings are not valid utf-8.

4) The benefits of seeing the real filename/branch name when it is utf-8
greatly outweigh the problems * probability that it isn't. I suppose we
could add a Configure item if you are still concerned.
"display_nice_urls = False" could disable trying to show pretty urls.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060502/e1c9dcb0/attachment.pgp 


More information about the bazaar mailing list