[MERGE] Fix handling of unicode paths in smart server requests
John Arbash Meinel
john at arbash-meinel.com
Thu Oct 19 17:44:59 BST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Bennetts wrote:
> Andrew Bennetts wrote:
> [...]
>> I think I've probably solved the wrong problem; as you say, url-quoting needs to
>> happen somewhere. The existing situation where unicode objects are passed as
>> relpaths to transports isn't right, though.
>
> Here's a bundle of a much simpler change: this doesn't try to UTF-8 decode every
> element of the request/response tuples. Paths sent over the wire should already
> be url-quoted, and thus already ASCII. Transport methods seem to expect
> relpaths as url-quoted bytestrings (i.e. type str not type unicode), e.g.
> Transport._combine_paths checks for this[1]. So it seems natural to normally work
> with tuples of bytestrings rather than unicode. This avoids the encoding issue
> I had, and it avoids unnecessary decoding and encoding, which is probably a
> small performance win.
>
> This is pretty small, but I'm not sure if it's small enough for it to sneak into
> 0.12 or not. It's a prerequisite for the WSGI branch, though.
>
> -Andrew.
>
> [1] It would be nice if the type of path objects Transport methods expect were
> clearly documented in those methods' docstrings. At the moment I'm figuring
> this out by browsing code, so it's possible I'm misunderstanding the
> behaviour and/or the intent. There are too many combinations of
> quoted/not-quoted, type-is-str/type-is-unicode to make it immedately
> obvious.
>
The joys of an evolving interface.
The official definition is that all paths that a Transport deals with
are urls or url fragments. This includes the path passed into the
initializer. Pre-0.9 things were commonly unicode, but we tried to unify
everything. So if you see places that are treating things as unicode,
they shouldn't be. Everything should be a url (so utf-8 encoded and url
quoted, basically run through urlutils.escape()).
...
> # message:
> # Don't UTF-8 decode paths in requests. They should be url-quoted (and thus
> # ASCII) already.
> #
> # Instead, change _encode_tuple and _decode_tuple to work with tuples of
> # bytestrings rather than unicode objects.
This seems like a fairly large conceptual change to the the smart server
api. Though it seems the correct one to make, it seemed to me that all
the old smart server code decoded things for you.
v- Robert has remarked that we tend to test the code once, and then test
it a second time with unicode. He proposed that we just test it the
first time with a unicode string, so it might be better to put this test
into the other one.
Also, I would like to see it make sure things are url escaped, so put a
' ' or a '+' or some other ASCII character that gets escaped.
> + def test_response_to_canned_get_of_utf8(self):
> + # wire-to-wire, using the whole stack, with a UTF-8 filename.
> + transport = memory.MemoryTransport('memory:///')
> + utf8_filename = u'testfile\N{INTERROBANG}'.encode('utf-8')
> + transport.put_bytes(utf8_filename, 'contents\nof\nfile\n')
> + to_server = StringIO('get\001' + utf8_filename + '\n')
> + from_server = StringIO()
> + server = smart.SmartServerPipeStreamMedium(
> + to_server, from_server, transport)
> + protocol = smart.SmartServerRequestProtocolOne(transport,
> + from_server.write)
> + server._serve_one_request(protocol)
> + self.assertEqual('ok\n'
> + '17\n'
> + 'contents\nof\nfile\n'
> + 'done\n',
> + from_server.getvalue())
> +
v- This is a little unfortunate, but we needed a way to pass back data
to the source. It might be better to put this in the 'bulk data'
section, but the way the error handler works, it wasn't possible.
> # with a plain string
> str_or_unicode = e.object
> if isinstance(str_or_unicode, unicode):
> - val = u'u:' + str_or_unicode
> + # XXX: UTF-8 might have \x01 (our seperator byte) in it. We
> + # should escape it somehow.
> + val = 'u:' + str_or_unicode.encode('utf-8')
> else:
> - val = u's:' + str_or_unicode.encode('base64')
> + val = 's:' + str_or_unicode.encode('base64')
> # This handles UnicodeEncodeError or UnicodeDecodeError
> return SmartServerResponse((e.__class__.__name__,
> e.encoding, val, str(e.start), str(e.end), e.reason))
> @@ -1218,7 +1213,7 @@
> end = int(resp[4])
> reason = str(resp[5]) # reason must always be a string
> if val.startswith('u:'):
> - val = val[2:]
> + val = val[2:].decode('utf-8')
> elif val.startswith('s:'):
> val = val[2:].decode('base64')
> if what == 'UnicodeDecodeError':
>
The changes seem correct to me. I wasn't sure if '\x01' was a valid
utf-8 character, but '\x01'.decode('utf-8') == u'\x01'. So it seems that
there is a Unicode point for '\x01'. I wonder if it has any meaning.
It would be another reason that stuff like this should go into the body,
rather than being in the args section...
Anyway, +1. As near as I can tell, the bytes on the wire haven't changed
after this, which means it hasn't changed compatibility. I don't think
WSGI made it to 0.12, though (which doesn't really matter, since it is
the server side of things, not the client, so if you want to install an
0.13 pre-release it should still be compatible with 0.12 clients).
So we may just merge this in when we merge the other smart changes in
early 0.13. But I would be okay with it coming in earlier if you have a
specific need for it.
Note: if I'm misunderstanding, I don't really want to change the
bytes-on-the-wire at this point. So if there are changes, then this
needs to wait.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFN6uLJdeBCYSNAAMRAoEDAKDG8cBk6XFL+jcrPsfMav4CvC/0IgCgs/Ek
TSsx7hNPuznk/TSCl5eNtvE=
=g6zg
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list