Unicode URLs (was Re: "Using Saved Location: foo")

Wed May 3 00:54:48 BST 2006

On Tue, 2006-05-02 at 09:02 -0500, John Arbash Meinel wrote:

> 1) I believe that it is a rare case. (Though I don't have much to back
> that up)

On a non-unicode machine, we render text by encoding unicode -> the code
page.

If the URL is encoded in the code page of the machine (which is quite
likely), then what we are doing is reinterpreting a (for instance)
utf16, or cp-1251 string as utf8, and then outputting that back into the
local encoding again - but with a different meaning.

If copy n paste of this works properly, you've now lost the correct
characters to pass the common 'treat unicode as being utf8 encoded in
the url' input heuristic a lot of programs have.

If it does not work correctly, then you still lose, as the byte sequence
is no longer the right one that the server wants.

A more common case is where the local encoding is utf8 safe, but the URL
is again not utf8 encoded. In this case we still transcode as far as
other inputs are concerned, but we dont break if the byte sequence is
preserve instead of the glyphs.

> 2) I'm not sure what you mean by *proper* URL. I think we might just be
> talking past each-other. My plan is that if a user types in something
> that is not 7-bit ASCII, I will treat it as unicode, and utf-8+urlescape
> it. If they type in 7-bit ASCII, then I will not change it, other than
> stuff like ' ' => %20.
> At that point I have what I have determined to be the 'correct' URL. And
> I will use it. If it turns out that the server doesn't use utf-8, then
> the user is required to type the path as 7-bit ASCII URL. (Except for
> the portion controlled by .bzr, since the local bzr interprets that
> portion of the path).

Well a wellformed URL is by definition one that matches the std66 *BNF,
I'm guessing Martin means the canonical representation of the URL.

> 3) It is rare that iso8859-1 will decode to utf-8. Because once you try
> for a non 7-bit character, I believe utf-8 always requires at least 2
> bytes. And it has special rules about what sets of bytes are allowed
> next to eachother. (Both are 8-bit, with specific bits based on number
> of characters, etc). So while anything can be decoded by iso-8859-1, a
> lot of byte strings are not valid utf-8.

I'm much less concerned by the input heuristic than the output
heuristic.

> 4) The benefits of seeing the real filename/branch name when it is utf-8
> greatly outweigh the problems * probability that it isn't. I suppose we
> could add a Configure item if you are still concerned.
> "display_nice_urls = False" could disable trying to show pretty urls.

So you are proposing that given a URL in its normal form - % escaped
sequence of bytes, you will unescape it once, and then try to interpret
the resulting byte sequence as a utf8 string. If that successfully
converts, then display it to the user as such?

I think this is highly risky because when it goes wrong, users will have
no discoverable fallback to get the real url. 

I suggest showing relative references for local paths via the utf8
decode approach, which will mean that local disk paths will look
correct, but references to external resources will be safely
transportable.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060503/306e7717/attachment.pgp