newformat format change
John A Meinel
john at arbash-meinel.com
Thu Oct 6 15:52:00 BST 2005
Martin Pool wrote:
> On 05/10/05, John A Meinel <john at arbash-meinel.com> wrote:
>
>
>>One problem with the current trapping (where you use a regular
>>expression to substitute everything that isn't a word character)
>>name = re.sub(r'[^\w.]', '', name)
>>
>>Which I believe will catch newlines and tabs. But it also seems to catch
>> too much in the way of international characters.
>>
>>Back when I was testing with Arabic characters, it was essentially
>>generating file-ids with just the last portion (no filename part).
>>Now, maybe you feel that your unique identifier is sufficient (it could be).
>
>
> I thought it would consider all unicode word characters to match \w.
> We should fix that; it seems reasonable to treat the id as unicode
> since filenames can be.
>
Well, I just checked with this bit of code:
>>> import re
>>> x = u'Juju <\u062c\u0648\u062c\u0648>'
>>> r = re.sub(u'[^\\w.]', '', x)
>>> r
u'Juju'
The actual string is:
Juju <جوجو>
The thing is that you need to supply the re.UNICODE flag as follows:
>>> p = re.compile(r'[^\w.]', re.UNICODE)
>>> r = p.sub('', x)
>>> r
u'Juju\u062c\u0648\u062c\u0648'
That seems to do the trick.
John
=:->
> --
> Martin
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051006/24fc66c4/attachment.pgp
More information about the bazaar
mailing list