[MERGE] push handles file-ids containing quotes correctly

Jan Hudec bulb at ucw.cz
Tue Jul 11 08:04:42 BST 2006


On Mon, Jul 10, 2006 at 18:10:46 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Aaron Bentley wrote:
> > John Arbash Meinel wrote:
> >>>>> Done, since you insist.  I still think it's pointless to test the
> >>>>> unescaper, since if the XML text is not a valid inventory, (even if it's
> >>>>> well-formed XML) file_ids_affected_by is probably broken.
> >>>>>
> >>>>> Aaron
> >>>
> >>> Well, it may be valid escaped XML.
> > 
> > Valid escaped XML doesn't make it a valid inventory.  Our inventories
> > are an XML subset that has line breaks after each entry, no comments, no
> > CDATA, no processing instructions, no numeric entity references...
> > 
> >>> We obviously missed ', so it
> >>> seem possible that we are missing others. 
> > 
> > Well, yeah.  Anything that's not ASCII is going to be serialized as
> > numeric entity references by ElementTree, because ElementTree defaults
> > to ASCII, not utf-8.  And we don't decode numeric entity references.
> > 
> 
> Well, it wouldn't be that hard to change the unescaper to try int() on
> the returned value. Or even change the regex to:
> 
> r'&((?P<numeric>\d+)|(P<text>[^;]*));'
> and then have the unescaper do something like:
> 
> numeric = match.group('numeric')
> if numeric is not None:
>   return unichr(numeric)
> else:
>   return _map[match.group('text')]
> 
> That assumes that numeric references match the python unicode codepoint,
> but I would guess that they do.

Pardon me, but I think the numeric entity reference regexp is wrong.
Numeric entities match r'&#\d+;'.

I would probably use the lastindex trick and do:

_unescape_re = re.compile(r'&(?:#(\d+)|(amp)|(gt)|(lt)|(apos)|(quot));')
_unescape_list = u"&><'\""

def _unescaper(match):
	if match.lastindex == 1:
		return unichr(int(match.group(1)))
	else:
		return _unescape_list[match.lastindex - 2]

I am using indices instead of texts because the comparison and array
lookup should be slightly faster (the strings are not interned).

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060711/d38a5be3/attachment.pgp 


More information about the bazaar mailing list