[MERGE] push handles file-ids containing quotes correctly

Tue Jul 11 00:10:46 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>>>> Done, since you insist.  I still think it's pointless to test the
>>>>> unescaper, since if the XML text is not a valid inventory, (even if it's
>>>>> well-formed XML) file_ids_affected_by is probably broken.
>>>>>
>>>>> Aaron
>>>
>>> Well, it may be valid escaped XML.
> 
> Valid escaped XML doesn't make it a valid inventory.  Our inventories
> are an XML subset that has line breaks after each entry, no comments, no
> CDATA, no processing instructions, no numeric entity references...
> 
>>> We obviously missed &apos;, so it
>>> seem possible that we are missing others. 
> 
> Well, yeah.  Anything that's not ASCII is going to be serialized as
> numeric entity references by ElementTree, because ElementTree defaults
> to ASCII, not utf-8.  And we don't decode numeric entity references.
> 

Well, it wouldn't be that hard to change the unescaper to try int() on
the returned value. Or even change the regex to:

r'&((?P<numeric>\d+)|(P<text>[^;]*));'
and then have the unescaper do something like:

numeric = match.group('numeric')
if numeric is not None:
  return unichr(numeric)
else:
  return _map[match.group('text')]

That assumes that numeric references match the python unicode codepoint,
but I would guess that they do.

>>> Which we won't find until a
>>> bug surfaces again. And I'd rather it surface early rather than later.
>>>
>>> I don't know of anything we are missing. But I know that as of right
>>> now, we don't have a lot of testing for extended unicode file ids.
> 
> Oh, you know that thing Robert says that "untested code is broken code"?
>  We can't even commit unicode file-ids.
> 

...

>     return "%02x/" % (adler32(fileid) & 0xff)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in
> position 18: ordinal not in range(128)
> 
> ----------------------------------------------------------------------
> Ran 4 tests in 0.364s
> 

Sure. adler32() for hash prefixes would only work on bytestreams. so
we'd need to utf8 it, or somesuch.

Certainly I found in my encoding work that saying you support unicode,
and actually supporting it are a little bit different. :)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEst51JdeBCYSNAAMRAhU3AJ9wKf6M/esa8IXGQZsGvBo4XZ84YwCg1Eql
d6Mg0kOtR7tnQuliI4IUoLg=
=iKGV
-----END PGP SIGNATURE-----