Unicode URLs (was Re: "Using Saved Location: foo")

Wed May 3 07:27:27 BST 2006

On 03/05/2006, at 12:02 AM, John Arbash Meinel wrote:
>>
>> The question is really: if one path component can't be decoded as  
>> UTF-8,
>> should you try to decode the others that way?  There are cases where
>> that might be right(1), but others where it's wrong(2):
>>
>>  (1) a server where user home directories are named in 8859-1, but
>>      one user chooses to name their files in utf-8
>>
>>  (2) a server where all paths are some codepage, but it just
>>      coincidentally happens that one path component can be decoded
>>      as garbage but valid UTF-8; it would be better not to decode
>>      at all
>
> The specific instance I am thinking about is any server that uses
> non-utf-8, but everything next to '.bzr' is going to be utf-8.
>
> So far, all of the bzr control files are ascii. And I think we're
> planning on keeping that (considering the lengths we're going with url
> escaping).

Yes, we certainly should, so that at least they are safe from these  
encoding problems.

> However, bzr supports stuff like:
>
> bzr log http://host/path/to/branch/file/in/branch
>
> So while the server would decode:
> http://host/path/to/branch/
>
> bzr itself is decoding "file/in/branch"
> And that section will always be utf-8 + urlquoting.

There's certainly a difference between the url of the start of the  
branch and the relpath inside it, but I disagree that the relpath is  
in utf-8.  At the moment we keep the relpath in memory as a Unicode  
string (in no particular encoding) after finding the root, and I  
think that's still the best thing to do.  Intentionally building URLs  
with different parts in different encodings sounds rather problematic.

If we wanted to access the working copy across http we'd need to put  
the filename in the appropriate encoding for the server, which will  
be the same as for the rest of the URL.

>>> Also, I think internally we should use the escaped form, and that is
>>> what we save and send to the host. That way, if the host uses a
>>> different escaping, the user can just manually escape it, rather  
>>> than
>>> typing it as unicode.
>>
>> To be more precise - we should internally use the *proper* URL, not
>> these heuristic kinda-urls.
>>
>> Accepting input is easy, but displaying URLs as Unicode seems quite
>> likely to break for people with non-UTF-8 servers.
>>
>
> 1) I believe that it is a rare case. (Though I don't have much to back
> that up)

If it really is rare it will make things much easier.

> 2) I'm not sure what you mean by *proper* URL. I think we might  
> just be
> talking past each-other.

Sorry, that was a bit vague.  What I meant was a URL matching the  
standard grammar - ie all characters escaped properly.

> My plan is that if a user types in something
> that is not 7-bit ASCII, I will treat it as unicode, and utf-8 
> +urlescape
> it. If they type in 7-bit ASCII, then I will not change it, other than
> stuff like ' ' => %20.

Yes.

> At that point I have what I have determined to be the 'correct'  
> URL. And
> I will use it. If it turns out that the server doesn't use utf-8, then
> the user is required to type the path as 7-bit ASCII URL. (Except for
> the portion controlled by .bzr, since the local bzr interprets that
> portion of the path).

Yes, I agree.

> 3) It is rare that iso8859-1 will decode to utf-8. Because once you  
> try
> for a non 7-bit character, I believe utf-8 always requires at least 2
> bytes. And it has special rules about what sets of bytes are allowed
> next to eachother. (Both are 8-bit, with specific bits based on number
> of characters, etc). So while anything can be decoded by iso-8859-1, a
> lot of byte strings are not valid utf-8.

> 4) The benefits of seeing the real filename/branch name when it is  
> utf-8
> greatly outweigh the problems * probability that it isn't. I  
> suppose we
> could add a Configure item if you are still concerned.
> "display_nice_urls = False" could disable trying to show pretty urls.

I think the combination of #1, 3, and 4 mean it's unlikely it will  
actually fail in a bad way, but it is possible.  Robert thinks the  
risk means we shouldn't do it; personally I feel as long as there is  
an option to turn it off it should be OK.

-- 
Martin