"Using Saved Location: foo"

John Arbash Meinel john at arbash-meinel.com
Wed May 3 15:17:26 BST 2006


Jan Hudec wrote:

...

>> I just wanted to mention that the isinstance() check will always return
>> the same thing for ever character. So really you only need to do:
>>
>> if not isinstance(psuedo_url, unicode):
>>   return psuedo_url
>>
>> r = []
>> if c in url_safe_characters:
>>   r.append(c)
>> else:
>>   r.append(urlescape(c)) # urlescape does utf-8 encoding :)
>> return ''.join(r)
>>
>> You have to be a little bit careful, since usually ":" is not url safe,
>> but it will occur in all fully qualified urls.
>> (Urllib quotes everything that isn't A-Za-z0-9_.-)
> 
> Bzr must *NOT* quote any of the special characters -- %/:?&=;,#$
> Otherwise there would be no way to enter their unescaped form.
> (note: according to RFC $ is special, though I don't really have idea
> when it's used)
> 
>> Also, the 'isinstance(...,unicode)' isn't really useful, because all
>> user input comes in as unicode. Which is why I was proposing to do:
>>
>> try:
>>   return psuedo_url.encode('ascii')
>> except UnicodeEncodeError:
>>   pass
>>
>> That should work for a large portion of URLs, and then we don't have to
>> do all of the per-character checking.
> 
> It's .encode('ascii') that won't do. This needs quoting as well:
> 
> http://www.ucw.cz/~bulb/{archives}/
> 
> Neither '~', '{' nor '}' are allowed literally.

Well, I just did a test on my webserver, and it does allow {}. (I tested
both with urllib, and manually telnet + GET)

However, that could just be that Apache is generous with what it
accepts. Heck, it even accepted a UTF-8 string in the GET request:
GET /pub/جوجو.html HTTP/1.0
Host: john.arbash-meinel.com

Worked just fine. And in fact:
urllib.urlopen(u'http://john.arbash-meinel.com/pub/جوجو.html'
                .encode('utf8')).read()

worked as well.
I'm not saying it is following any sort of standard. And it is better to
follow the standard, then exploit whatever freedom a particular server
offers. But I found it interesting.

> 
> The point of isinstance(...,unicode) is, that user input is always
> unicode while the encoded form is not. User input always needs exactly
> one round of encoding.
> 

You make a good point. There are characters that are valid ASCII that
still aren't valid in URLs. I was concerned about double escaping the %
characters, but we certainly could use this heuristic:

1) If it doesn't have :// it is a local path. Convert it into a file:///
   url by escaping everything except A-Za-z0-9_.-/ (and on windows
   convert \ => /, and handle C: properly)
2) If it has ://, consider it to be a hybrid URL. That is, a URL which
   may have some escaping, but might also have some Unicode, or other
   non URL characters.
   Convert hybrid->normalized URL using:
	encode non-ascii characters with utf8 + % escaping.
	% escape all other characters, except A-Za-z0-9_.-/:?&=;,#$

3) Saved URLs will always be saved as normalized URLs, so should not be
re-normalized.


get_transport(path) can further use this heuristic:
if isinstance(path, unicode):
	path is 1 or 2
else isinstance(path, str):
	path is 1 or 3

Functions like Branch.get_parent() should realize that they always
return URLs and thus should always return a plain str() not a unicode
string. (Which naively reading .bzr/branch/parent as a utf8 file would do).

Does that sound complete?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060503/b3bcaeca/attachment.pgp 


More information about the bazaar mailing list