Unicode URLs (was Re: "Using Saved Location: foo")
mbp at sourcefrog.net
Tue May 2 14:49:42 BST 2006
On 2 May 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Martin Pool wrote:
> > On 02/05/2006, at 3:18 PM, Jan Hudec wrote:
> >> Well, with a twist. The input url is actually 'encoded', so urlescape
> >> must not escape % in this case.
> > Right - actually I think doing it chunk-by-chunk is just going to
> > complicate things. I think the other algorithm I quoted would be better
> > for handling these pseudo-URLs, since it doesn't touch things that might
> > already be escaped:
> > for c in pseudo_url:
> > if c in url_safe_characters:
> > r += c
> > else:
> > if isinstance(c, unicode):
> > r += urlescape(c.encode('utf-8'))
> > --Martin
> Well, I think we need to decode for display by chunk, because if a chunk
> doesn't decode by 'utf-8' then I think we should leave it escaped.
The question is really: if one path component can't be decoded as UTF-8,
should you try to decode the others that way? There are cases where
that might be right(1), but others where it's wrong(2):
(1) a server where user home directories are named in 8859-1, but
one user chooses to name their files in utf-8
(2) a server where all paths are some codepage, but it just
coincidentally happens that one path component can be decoded
as garbage but valid UTF-8; it would be better not to decode
> Also, I think internally we should use the escaped form, and that is
> what we save and send to the host. That way, if the host uses a
> different escaping, the user can just manually escape it, rather than
> typing it as unicode.
To be more precise - we should internally use the *proper* URL, not
these heuristic kinda-urls.
Accepting input is easy, but displaying URLs as Unicode seems quite
likely to break for people with non-UTF-8 servers.
More information about the bazaar