newformat format change

Thu Oct 6 15:52:00 BST 2005

Martin Pool wrote:
> On 05/10/05, John A Meinel <john at arbash-meinel.com> wrote:
> 
> 
>>One problem with the current trapping (where you use a regular
>>expression to substitute everything that isn't a word character)
>>name = re.sub(r'[^\w.]', '', name)
>>
>>Which I believe will catch newlines and tabs. But it also seems to catch
>> too much in the way of international characters.
>>
>>Back when I was testing with Arabic characters, it was essentially
>>generating file-ids with just the last portion (no filename part).
>>Now, maybe you feel that your unique identifier is sufficient (it could be).
> 
> 
> I thought it would consider all unicode word characters to match \w. 
> We should fix that; it seems reasonable to treat the id as unicode
> since filenames can be.
> 

Well, I just checked with this bit of code:

 >>> import re
 >>> x = u'Juju <\u062c\u0648\u062c\u0648>'
 >>> r = re.sub(u'[^\\w.]', '', x)
 >>> r
u'Juju'

The actual string is:
Juju <جوجو>

The thing is that you need to supply the re.UNICODE flag as follows:
 >>> p = re.compile(r'[^\w.]', re.UNICODE)
 >>> r = p.sub('', x)
 >>> r
u'Juju\u062c\u0648\u062c\u0648'

That seems to do the trick.
John
=:->

> --
> Martin
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051006/24fc66c4/attachment.pgp