Unicode URLs (was Re: "Using Saved Location: foo")

Wed May 3 02:36:05 BST 2006

Robert Collins wrote:
> On Tue, 2006-05-02 at 09:02 -0500, John Arbash Meinel wrote:
> 
> 
>> 1) I believe that it is a rare case. (Though I don't have much to back
>> that up)
> 

I may need an example, or maybe I just need to read what you wrote a few
times. But it isn't quite clicking for me.

> On a non-unicode machine, we render text by encoding unicode -> the code
> page.

Well, on a utf-8 machine, we render the text by encoding unicode ->
utf-8. I suppose if sys.stdout.encoding was utf-16 then we would use
that. But sure, we try to render the text using the codepage.

> 
> If the URL is encoded in the code page of the machine (which is quite
> likely), then what we are doing is reinterpreting a (for instance)
> utf16, or cp-1251 string as utf8, and then outputting that back into the
> local encoding again - but with a different meaning.

I don't understand the 'but with a different meaning'. And there is also
some confusion as to whether you mean the % escaped URL, or the
code-page encoded bytestream.

Arguments to bzr are assumed to be in bzrlib.user_encoded and are
decoded into Unicode by the command parser. If they were 7-bit ascii
URLs, this generally works anyway since most (all?) code pages keep
ascii in their lower bits.

So lets try an example. My codepage is latin-1, am I inputing the
string: (I'm using python notation for all strings to remove ambiguity
as to what the exact bytes are)
'bzr get http://bazaar-vcs.org/bzr/b\xe5gfors'
Or am I typing
'bzr get http://bazaar-vcs.org/bzr/b%E5gfors'

The former would be transformed internally into:
['bzr', u'get', u'http://bazaar-vcs.org/bzr/b\xe5gfors']

Which is a Unicode "url", not an ascii one. At this point, we have to do
something with it. We could
A) Disallow non-ascii URLs
B) Assume our user input is reasonably correct, and create a new URL
that is ASCII by encoding into utf-8 and url quoting. Ending up with the
url: 'http://bazaar-vcs.org/bzr/b%C3%A5gfors'

Now, I think we agree on the user input side. That we take what we can
from the user, and we 'normalize' Unicode characters into utf-8 + %
escaping. Based on the assumption that web servers will be exporting
utf-8 paths.
Now, one could also give the alternate assumption that a user will
generally access a server which is using the same encoding that they
have. So we should normalize the url into:
'http://bazaar-vcs.org/bzr/b%E5gfors'

So that is the input side. Now suppose internally we somehow arrived at
having url='http://bazaar-vcs.org/bzr/b%C3%A5gfors'
The user could have typed that directly, and we would have gotten the
unicode string
url=u'http://bazaar-vcs.org/bzr/b%C3%A5gfors'
which we then find is 7-bit ascii, and just return
url='http://bazaar-vcs.org/bzr/b%C3%A5gfors'

So either user input ends up with that string. At this point, we have
saved that exact string into .bzr/branch/parent
And now I go to do "bzr pull", and bzr decides to print out what branch
it is connecting to.

I'm proposing that when we print it, we try to unescape and unencode the
URL (by chunk) and if that succeeds we create the unicode string:
url = u'http://bazaar-vcs.org/bzr/b\xe5gfors'
Which will then be printed in latin-1 as:
'http://bazaar-vcs.org/bzr/b\xe5gfors'
(I wish at this point I knew of an encoding that supported a character
but doesn't translate into the exact same byte value. µ => \xb5 in
cp1251 and latin-1 and unicode (\xc2\xb5 in utf-8).

Because sys.stdout.encoding is latin-1, the above string should show
properly in the user's terminal as:
http://bazaar-vcs.org/bzr/bågfors

> 
> If copy n paste of this works properly, you've now lost the correct
> characters to pass the common 'treat unicode as being utf8 encoded in
> the url' input heuristic a lot of programs have.

If I understand you correctly, you are saying that Firefox will try to
connect to http://bazaar-vcs.org/bzr/b%C3%A5gfors because of the unicode
character, even though the user's encoding is latin-1. Honestly, I don't
know which way Firefox would go.

Now, if earlier on, the user typed:
'bzr get http://bazaar-vcs.org/bzr/b%E5gfors'
that would get decoded into 'unicode', and then back as a URL without
being changed. So when we read it in from .bzr/branch/parent and try to
decode it for display, the decoding will fail, and we will not change
the string. Thus printing:
'http://bazaar-vcs.org/bzr/b%E5gfors'

> 
> If it does not work correctly, then you still lose, as the byte sequence
> is no longer the right one that the server wants.
> 

If the user types "%E5" you haven't lost anything, because that will be
preserved. If the user type "\xe5" then you have an ambiguity anyway,
and we should resolve it as best we can.

> A more common case is where the local encoding is utf8 safe, but the URL
> is again not utf8 encoded. In this case we still transcode as far as
> other inputs are concerned, but we dont break if the byte sequence is
> preserve instead of the glyphs.
> 

I'm not sure what local encoding is 'utf-8' safe without being utf-8.

And I don't really know what the sentence "byte sequence is preserve
instead of the glyphs"

I think you meant 'preserved'. And you are in some ways correct. If the
user typed "%C3%A5" and their encoding is latin-1, the bytes that come
out would be "\xe5".

I could be persuaded that we can only try to decode utf-8 if
user_encoding == 'utf-8'. But I'm not really willing to give up printing
out nice looking strings so the user can tell that they are grabbing the
http://bzr.arbash-meinel.com/branches/foo/جوجو
instead of
http://bzr.arbash-meinel.com/branches/foo/لولو

Rather than seeing the branches:
http://bzr.arbash-meinel.com/branches/foo/%D8%AC%D9%88%D8%AC%D9%88
and
http://bzr.arbash-meinel.com/branches/foo/%D9%84%D9%88%D9%84%D9%88

The former is quite obvious even for someone who doesn't know arabic.
The latter is very difficult even for someone who knows hex codes.

They may not know what the characters mean, but it is obvious at a
glance that they are different branches. I have to focus really hard and
compare character by character to know that they are different. *And*
the only way I could know which was which would be to open python, and
tell it to decode it for me.

So, I'm willing to be very strict as to what conditions we will allow
the URLs to be unescaped for display. For example the user_encoding must
be Unicode safe, ie utf8, utf16, ucs2, etc.

>> The benefits of seeing the real filename/branch name when it is utf-8
>> greatly outweigh the problems * probability that it isn't. I suppose we
>> could add a Configure item if you are still concerned.
>> "display_nice_urls = False" could disable trying to show pretty urls.
> 
> So you are proposing that given a URL in its normal form - % escaped
> sequence of bytes, you will unescape it once, and then try to interpret
> the resulting byte sequence as a utf8 string. If that successfully
> converts, then display it to the user as such?
> 
> I think this is highly risky because when it goes wrong, users will have
> no discoverable fallback to get the real url. 
> 
> I suggest showing relative references for local paths via the utf8
> decode approach, which will mean that local disk paths will look
> correct, but references to external resources will be safely
> transportable.
> 
> Rob
> 

I feel like I've given a strong example where you would want remote URLs
to still display Unicode. If we disagree, then I can code it up to be
configurable, and we can try and reach a bzr community consensus about
what the default should be.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060502/f1663c83/attachment.pgp