Unicode through filesystem tricks

Sat Jan 14 20:30:17 GMT 2006

Aaron Bentley wrote:
> John A Meinel wrote:
>>> But does that mean that now anytime we read from the user, or read from
>>> the filesystem we need to do:
>>>
>>> s = unicodedata.normalize('????', s.decode(bzrlib.user_encoding))
> 
> I think we can do this on a case-by-case basis.  Technically, though,
> any data generated on a different system may have a different encoding.
> 
>>> That may be the sanest way. Or maybe we would only have to do it on
>>> I'm still trying to understand it. So far, it seems like 'canonical'
>>> means that they are exactly the same character.
> 
> Canonical means the sole sanctioned representation.  The canonical
> composed representation for a-with-acute has the same numerical value as
> the iso-8859-1 character.  The canonical decomposed representation, I
> assume, has 'a' followed by the 'accute' combining character.
> 
>>> So how do we want to represent unicode strings inside bzr? It seems they
>>> should be normalized, but which form?
> 
> NFC is the most comptatible.
> 
>>> So they sound less efficient in CPU cycles,
>>> though they end up being shorter in physical bytes.
> 
> I doubt the spec requires them to actually do the decomposition, as long
> as the effect is as though they had.
> 
>>> My first preference would be to use NFKC, since those would end up being
>>> more compact, 
> 
> Also, because of its similarity to iso-8859, more compatible.
> 
>>> Any thoughts?
> 
> If some filesystems are doing normalization, we must ensure that
> normalization is always performed, because normalized filesystems have
> fewer possible names.
> 
> Aaron

So, to make things easy, I would like for bzr to only support
canonicalized names internally. So you can't 'bzr add' a non-canonical
name under windows or linux (where they can exist), and they will be
translated appropriately under Mac (where the form we don't prefer exists).
I suppose we should keep support for any form in '.bzrignore', since you
might have a non-canonical name in a linux tree that you just want to be
ignored. (Automatically generated by some tool, for instance).

We have already made the statement that we only support valid unicode
names inside source trees, which I think is reasonable. And it isn't
much harder to request normalized valid unicode names.

I'm going to need to do something in my encoding branch, just to get the
test suite to pass on Mac OSX (my laptop). So in one sense, I can make
this the 'de facto' statement, since I'm the one doing the unicode
compatibility right now.

I see the next few steps as ...

1) Fix up and audit the Transport objects, to make sure they are
actually a URL interface. (They always return url strings, and only
accept urls).
2) Create a 'osutils.unicode_filename' function. On Mac, this would
normalize the filenames. On Linux and Win32, this would make sure the
filenames are normalized already. Something like:

if sys.platform == 'darwin':
  def unicode_filename(path):
    return unicodedata.normalize('NFKC', path), True
else:
  def unicode_filename(path):
    return path, unicodedata.normalize('NFCK', path) == path

3) Fix WorkingTree so that it knows what to do if unicode_filename
returns the fact that a path isn't normalized. (add should fail with an
exception, unknowns should not warn, ignored should do the right thing,
etc).

How does this sound?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060114/6ed93c0d/attachment.pgp