Unicode URLs (was Re: "Using Saved Location: foo")
John Arbash Meinel
john at arbash-meinel.com
Wed May 3 15:10:27 BST 2006
Martin Pool wrote:
> On 03/05/2006, at 12:02 AM, John Arbash Meinel wrote:
...
>> The specific instance I am thinking about is any server that uses
>> non-utf-8, but everything next to '.bzr' is going to be utf-8.
>>
>> So far, all of the bzr control files are ascii. And I think we're
>> planning on keeping that (considering the lengths we're going with url
>> escaping).
>
> Yes, we certainly should, so that at least they are safe from these
> encoding problems.
I agree. It also handles the case where the local filesystem does
something weird with Unicode names. (I'm looking at you HFS+).
>
>> However, bzr supports stuff like:
>>
>> bzr log http://host/path/to/branch/file/in/branch
>>
>> So while the server would decode:
>> http://host/path/to/branch/
>>
>> bzr itself is decoding "file/in/branch"
>> And that section will always be utf-8 + urlquoting.
>
> There's certainly a difference between the url of the start of the
> branch and the relpath inside it, but I disagree that the relpath is in
> utf-8. At the moment we keep the relpath in memory as a Unicode string
> (in no particular encoding) after finding the root, and I think that's
> still the best thing to do. Intentionally building URLs with different
> parts in different encodings sounds rather problematic.
>
> If we wanted to access the working copy across http we'd need to put the
> filename in the appropriate encoding for the server, which will be the
> same as for the rest of the URL.
I'm not advocating that we maintain a URL internally for the working
directory. I think we settled that it should be a Unicode string.
The question, though, is that when I type:
bzr log http://host/branch/file
How would I type "file" if it was a unicode name?
1) I would like to be able to type just the plain unicode, but until bzr
discovers where the 'branch' is, it needs to have some sort of URL.
2) I think utf-8 makes the most sense for encoding 'file' until we
figure out that it is *inside* the branch.
...
> I think the combination of #1, 3, and 4 mean it's unlikely it will
> actually fail in a bad way, but it is possible. Robert thinks the risk
> means we shouldn't do it; personally I feel as long as there is an
> option to turn it off it should be OK.
>
> --Martin
One other possibility presents itself. If we change the command parser
to pass the byte string that was given, we could save that instead. It
won't be 7-bit ascii, but it is just an 8-bit blob.
Then if we chop a couple of path sections off the end, we should still
end up with what the user said was the official path to the other branch.
And then we just display back to them whatever they typed in.
We probably would have to apply some heuristics internally to change
their 8-bit blob into a 7-bit URL, but we wouldn't have to write that
into .bzr/branch/parent.
However, this would mean that .bzr/branch/parent is no longer a utf-8
file. And it would mean that if someone has their local encoding in
latin-1, and they publish a branch, someone else with encoding utf-8
will misinterpret the .bzr/branch/parent file.
I'm not really advocating it, but if we are worried about the round-trip
effects of user input => unicode => url => unicode => output, we could
skip a lot of those steps.
There is another issue that I thought of after Robert mentioned it. What
about when a URL cannot be displayed in the current codepage (cp1251).
At that point we should be leaving it in URL escaped, because that is
better than getting a decode error or a "?" in the path.
So the function becomes "unescape_for_display(url, encoding)", where we
add one more step of trying to convert the string into the user's
encoding, and if it fails, we leave the hunk unchanged.
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060503/a2e16d68/attachment.pgp
More information about the bazaar
mailing list