Unicode URLs (was Re: "Using Saved Location: foo")

Tue May 2 14:49:42 BST 2006

On  2 May 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Martin Pool wrote:
> > On 02/05/2006, at 3:18 PM, Jan Hudec wrote:
> >>
> > 
> >> Well, with a twist. The input url is actually 'encoded', so urlescape
> >> must not escape % in this case.
> > 
> > Right - actually I think doing it chunk-by-chunk is just going to
> > complicate things.  I think the other algorithm I quoted would be better
> > for handling these pseudo-URLs, since it doesn't touch things that might
> > already be escaped:
> > 
> >   for c in pseudo_url:
> >     if c in url_safe_characters:
> >       r += c
> >     else:
> >       if isinstance(c, unicode):
> >         r += urlescape(c.encode('utf-8'))
> > 
> > --Martin
> 
> Well, I think we need to decode for display by chunk, because if a chunk
> doesn't decode by 'utf-8' then I think we should leave it escaped.

The question is really: if one path component can't be decoded as UTF-8,
should you try to decode the others that way?  There are cases where
that might be right(1), but others where it's wrong(2):

 (1) a server where user home directories are named in 8859-1, but
     one user chooses to name their files in utf-8

 (2) a server where all paths are some codepage, but it just
     coincidentally happens that one path component can be decoded
     as garbage but valid UTF-8; it would be better not to decode 
     at all

> Also, I think internally we should use the escaped form, and that is
> what we save and send to the host. That way, if the host uses a
> different escaping, the user can just manually escape it, rather than
> typing it as unicode.

To be more precise - we should internally use the *proper* URL, not
these heuristic kinda-urls.

Accepting input is easy, but displaying URLs as Unicode seems quite
likely to break for people with non-UTF-8 servers.

-- 
Martin