[MERGE][BUG #185458] Switch NFKC=>NFC for normalization checks

John Arbash Meinel john at arbash-meinel.com
Mon Jan 28 20:17:14 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2008-01-28 at 11:09 -0600, John Arbash Meinel wrote:
>> The bug itself has more detail, but basically NFC is a better 
>> normalization to use. This changes the internals to use it, and adds a 
>> couple small test updates.
> 
> Doesn't this require a format bump?
> 
> -Rob

I don't think so.

At the moment all we do (effectively) is:

if name != normalize('NFKC', name):
  raise IllegalName()

We don't actually translate the names anymore. (We haven't since
WT4/dirstate.)

NFC does less than NFKC, so it should only allow more things than were
allowed before.

In other words, I'm pretty sure that:

  normalize('NFC', normalize('NFKC', x)) == normalize('NFKC', x)

I tested it with this loop:

>>> for i in xrange(100000):
...   x = unichr(i)
...   y = unicodedata.normalize('NFC', x)
...   z = unicodedata.normalize('NFKC', x)
...   if y == z: continue
...   a = unicodedata.normalize('NFC', z)
...   if a != z:
...     print i, x, y, z, a
...

And found 0 hits.

I wasn't sure how high to go up to, but since the ones we are having
problems with are things like µ:u'\xb5',  ¼:u'\xbc'

I figured 100,000 would be enough to find any other oddities.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHnjhKJdeBCYSNAAMRAhxDAJ0QkkekWQBp6r71kwmnKUhkRG40CACeO84y
OqBSOTmpRaMI41FR05xjuuE=
=QBzg
-----END PGP SIGNATURE-----



More information about the bazaar mailing list