"Using Saved Location: foo"
John Arbash Meinel
john at arbash-meinel.com
Wed May 3 15:17:26 BST 2006
Jan Hudec wrote:
...
>> I just wanted to mention that the isinstance() check will always return
>> the same thing for ever character. So really you only need to do:
>>
>> if not isinstance(psuedo_url, unicode):
>> return psuedo_url
>>
>> r = []
>> if c in url_safe_characters:
>> r.append(c)
>> else:
>> r.append(urlescape(c)) # urlescape does utf-8 encoding :)
>> return ''.join(r)
>>
>> You have to be a little bit careful, since usually ":" is not url safe,
>> but it will occur in all fully qualified urls.
>> (Urllib quotes everything that isn't A-Za-z0-9_.-)
>
> Bzr must *NOT* quote any of the special characters -- %/:?&=;,#$
> Otherwise there would be no way to enter their unescaped form.
> (note: according to RFC $ is special, though I don't really have idea
> when it's used)
>
>> Also, the 'isinstance(...,unicode)' isn't really useful, because all
>> user input comes in as unicode. Which is why I was proposing to do:
>>
>> try:
>> return psuedo_url.encode('ascii')
>> except UnicodeEncodeError:
>> pass
>>
>> That should work for a large portion of URLs, and then we don't have to
>> do all of the per-character checking.
>
> It's .encode('ascii') that won't do. This needs quoting as well:
>
> http://www.ucw.cz/~bulb/{archives}/
>
> Neither '~', '{' nor '}' are allowed literally.
Well, I just did a test on my webserver, and it does allow {}. (I tested
both with urllib, and manually telnet + GET)
However, that could just be that Apache is generous with what it
accepts. Heck, it even accepted a UTF-8 string in the GET request:
GET /pub/جوجو.html HTTP/1.0
Host: john.arbash-meinel.com
Worked just fine. And in fact:
urllib.urlopen(u'http://john.arbash-meinel.com/pub/جوجو.html'
.encode('utf8')).read()
worked as well.
I'm not saying it is following any sort of standard. And it is better to
follow the standard, then exploit whatever freedom a particular server
offers. But I found it interesting.
>
> The point of isinstance(...,unicode) is, that user input is always
> unicode while the encoded form is not. User input always needs exactly
> one round of encoding.
>
You make a good point. There are characters that are valid ASCII that
still aren't valid in URLs. I was concerned about double escaping the %
characters, but we certainly could use this heuristic:
1) If it doesn't have :// it is a local path. Convert it into a file:///
url by escaping everything except A-Za-z0-9_.-/ (and on windows
convert \ => /, and handle C: properly)
2) If it has ://, consider it to be a hybrid URL. That is, a URL which
may have some escaping, but might also have some Unicode, or other
non URL characters.
Convert hybrid->normalized URL using:
encode non-ascii characters with utf8 + % escaping.
% escape all other characters, except A-Za-z0-9_.-/:?&=;,#$
3) Saved URLs will always be saved as normalized URLs, so should not be
re-normalized.
get_transport(path) can further use this heuristic:
if isinstance(path, unicode):
path is 1 or 2
else isinstance(path, str):
path is 1 or 3
Functions like Branch.get_parent() should realize that they always
return URLs and thus should always return a plain str() not a unicode
string. (Which naively reading .bzr/branch/parent as a utf8 file would do).
Does that sound complete?
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060503/b3bcaeca/attachment.pgp
More information about the bazaar
mailing list