[MERGE] push handles file-ids containing quotes correctly
Jan Hudec
bulb at ucw.cz
Tue Jul 11 08:04:42 BST 2006
On Mon, Jul 10, 2006 at 18:10:46 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Aaron Bentley wrote:
> > John Arbash Meinel wrote:
> >>>>> Done, since you insist. I still think it's pointless to test the
> >>>>> unescaper, since if the XML text is not a valid inventory, (even if it's
> >>>>> well-formed XML) file_ids_affected_by is probably broken.
> >>>>>
> >>>>> Aaron
> >>>
> >>> Well, it may be valid escaped XML.
> >
> > Valid escaped XML doesn't make it a valid inventory. Our inventories
> > are an XML subset that has line breaks after each entry, no comments, no
> > CDATA, no processing instructions, no numeric entity references...
> >
> >>> We obviously missed ', so it
> >>> seem possible that we are missing others.
> >
> > Well, yeah. Anything that's not ASCII is going to be serialized as
> > numeric entity references by ElementTree, because ElementTree defaults
> > to ASCII, not utf-8. And we don't decode numeric entity references.
> >
>
> Well, it wouldn't be that hard to change the unescaper to try int() on
> the returned value. Or even change the regex to:
>
> r'&((?P<numeric>\d+)|(P<text>[^;]*));'
> and then have the unescaper do something like:
>
> numeric = match.group('numeric')
> if numeric is not None:
> return unichr(numeric)
> else:
> return _map[match.group('text')]
>
> That assumes that numeric references match the python unicode codepoint,
> but I would guess that they do.
Pardon me, but I think the numeric entity reference regexp is wrong.
Numeric entities match r'&#\d+;'.
I would probably use the lastindex trick and do:
_unescape_re = re.compile(r'&(?:#(\d+)|(amp)|(gt)|(lt)|(apos)|(quot));')
_unescape_list = u"&><'\""
def _unescaper(match):
if match.lastindex == 1:
return unichr(int(match.group(1)))
else:
return _unescape_list[match.lastindex - 2]
I am using indices instead of texts because the comparison and array
lookup should be slightly faster (the strings are not interned).
--
Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060711/d38a5be3/attachment.pgp
More information about the bazaar
mailing list