"Using Saved Location: foo"

Jan Hudec bulb at ucw.cz
Wed May 3 20:04:44 BST 2006


On Wed, May 03, 2006 at 09:17:26 -0500, John Arbash Meinel wrote:
> Jan Hudec wrote:
> 
> ...
> 
> >> I just wanted to mention that the isinstance() check will always return
> >> the same thing for ever character. So really you only need to do:
> >>
> >> if not isinstance(psuedo_url, unicode):
> >>   return psuedo_url
> >>
> >> r = []
> >> if c in url_safe_characters:
> >>   r.append(c)
> >> else:
> >>   r.append(urlescape(c)) # urlescape does utf-8 encoding :)
> >> return ''.join(r)
> >>
> >> You have to be a little bit careful, since usually ":" is not url safe,
> >> but it will occur in all fully qualified urls.
> >> (Urllib quotes everything that isn't A-Za-z0-9_.-)
> > 
> > Bzr must *NOT* quote any of the special characters -- %/:?&=;,#$
> > Otherwise there would be no way to enter their unescaped form.
> > (note: according to RFC $ is special, though I don't really have idea
> > when it's used)
> > 
> >> Also, the 'isinstance(...,unicode)' isn't really useful, because all
> >> user input comes in as unicode. Which is why I was proposing to do:
> >>
> >> try:
> >>   return psuedo_url.encode('ascii')
> >> except UnicodeEncodeError:
> >>   pass
> >>
> >> That should work for a large portion of URLs, and then we don't have to
> >> do all of the per-character checking.
> > 
> > It's .encode('ascii') that won't do. This needs quoting as well:
> > 
> > http://www.ucw.cz/~bulb/{archives}/
> > 
> > Neither '~', '{' nor '}' are allowed literally.
> 
> Well, I just did a test on my webserver, and it does allow {}. (I tested
> both with urllib, and manually telnet + GET)
> 
> However, that could just be that Apache is generous with what it
> accepts. Heck, it even accepted a UTF-8 string in the GET request:
> GET /pub/جوجو.html HTTP/1.0
> Host: john.arbash-meinel.com
> 
> Worked just fine. And in fact:
> urllib.urlopen(u'http://john.arbash-meinel.com/pub/جوجو.html'
>                 .encode('utf8')).read()
> 
> worked as well.
> I'm not saying it is following any sort of standard. And it is better to
> follow the standard, then exploit whatever freedom a particular server
> offers. But I found it interesting.
> 
> > 
> > The point of isinstance(...,unicode) is, that user input is always
> > unicode while the encoded form is not. User input always needs exactly
> > one round of encoding.
> > 
> 
> You make a good point. There are characters that are valid ASCII that
> still aren't valid in URLs. I was concerned about double escaping the %
> characters, but we certainly could use this heuristic:
> 
> 1) If it doesn't have :// it is a local path. Convert it into a file:///
>    url by escaping everything except A-Za-z0-9_.-/ (and on windows
>    convert \ => /, and handle C: properly)
> 2) If it has ://, consider it to be a hybrid URL. That is, a URL which
>    may have some escaping, but might also have some Unicode, or other
>    non URL characters.
>    Convert hybrid->normalized URL using:
> 	encode non-ascii characters with utf8 + % escaping.
> 	% escape all other characters, except A-Za-z0-9_.-/:?&=;,#$
> 
> 3) Saved URLs will always be saved as normalized URLs, so should not be
> re-normalized.
> 
> 
> get_transport(path) can further use this heuristic:
> if isinstance(path, unicode):
> 	path is 1 or 2
> else isinstance(path, str):
> 	path is 1 or 3
> 
> Functions like Branch.get_parent() should realize that they always
> return URLs and thus should always return a plain str() not a unicode
> string. (Which naively reading .bzr/branch/parent as a utf8 file would do).
> 
> Does that sound complete?

To me it sounds complete and sound.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060503/dfc63b31/attachment.pgp 


More information about the bazaar mailing list