[MERGE] push handles file-ids containing quotes correctly
John Arbash Meinel
john at arbash-meinel.com
Tue Jul 11 00:10:46 BST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>>>> Done, since you insist. I still think it's pointless to test the
>>>>> unescaper, since if the XML text is not a valid inventory, (even if it's
>>>>> well-formed XML) file_ids_affected_by is probably broken.
>>>>>
>>>>> Aaron
>>>
>>> Well, it may be valid escaped XML.
>
> Valid escaped XML doesn't make it a valid inventory. Our inventories
> are an XML subset that has line breaks after each entry, no comments, no
> CDATA, no processing instructions, no numeric entity references...
>
>>> We obviously missed ', so it
>>> seem possible that we are missing others.
>
> Well, yeah. Anything that's not ASCII is going to be serialized as
> numeric entity references by ElementTree, because ElementTree defaults
> to ASCII, not utf-8. And we don't decode numeric entity references.
>
Well, it wouldn't be that hard to change the unescaper to try int() on
the returned value. Or even change the regex to:
r'&((?P<numeric>\d+)|(P<text>[^;]*));'
and then have the unescaper do something like:
numeric = match.group('numeric')
if numeric is not None:
return unichr(numeric)
else:
return _map[match.group('text')]
That assumes that numeric references match the python unicode codepoint,
but I would guess that they do.
>>> Which we won't find until a
>>> bug surfaces again. And I'd rather it surface early rather than later.
>>>
>>> I don't know of anything we are missing. But I know that as of right
>>> now, we don't have a lot of testing for extended unicode file ids.
>
> Oh, you know that thing Robert says that "untested code is broken code"?
> We can't even commit unicode file-ids.
>
...
> return "%02x/" % (adler32(fileid) & 0xff)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in
> position 18: ordinal not in range(128)
>
> ----------------------------------------------------------------------
> Ran 4 tests in 0.364s
>
Sure. adler32() for hash prefixes would only work on bytestreams. so
we'd need to utf8 it, or somesuch.
Certainly I found in my encoding work that saying you support unicode,
and actually supporting it are a little bit different. :)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFEst51JdeBCYSNAAMRAhU3AJ9wKf6M/esa8IXGQZsGvBo4XZ84YwCg1Eql
d6Mg0kOtR7tnQuliI4IUoLg=
=iKGV
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list