[MERGE][BUG #185458] Switch NFKC=>NFC for normalization checks
John Arbash Meinel
john at arbash-meinel.com
Mon Jan 28 20:17:14 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> On Mon, 2008-01-28 at 11:09 -0600, John Arbash Meinel wrote:
>> The bug itself has more detail, but basically NFC is a better
>> normalization to use. This changes the internals to use it, and adds a
>> couple small test updates.
>
> Doesn't this require a format bump?
>
> -Rob
I don't think so.
At the moment all we do (effectively) is:
if name != normalize('NFKC', name):
raise IllegalName()
We don't actually translate the names anymore. (We haven't since
WT4/dirstate.)
NFC does less than NFKC, so it should only allow more things than were
allowed before.
In other words, I'm pretty sure that:
normalize('NFC', normalize('NFKC', x)) == normalize('NFKC', x)
I tested it with this loop:
>>> for i in xrange(100000):
... x = unichr(i)
... y = unicodedata.normalize('NFC', x)
... z = unicodedata.normalize('NFKC', x)
... if y == z: continue
... a = unicodedata.normalize('NFC', z)
... if a != z:
... print i, x, y, z, a
...
And found 0 hits.
I wasn't sure how high to go up to, but since the ones we are having
problems with are things like µ:u'\xb5', ¼:u'\xbc'
I figured 100,000 would be enough to find any other oddities.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHnjhKJdeBCYSNAAMRAhxDAJ0QkkekWQBp6r71kwmnKUhkRG40CACeO84y
OqBSOTmpRaMI41FR05xjuuE=
=QBzg
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list