[win32] non-ascii/non-english file names: internal usage of file names

John A Meinel john at arbash-meinel.com
Wed Nov 30 22:31:36 GMT 2005


Jan Hudec wrote:
> On Wed, Nov 30, 2005 at 10:23:23 -0500, Aaron Bentley wrote:

...

> Converting filenames from local encoding to unicode is not a problem (as bzr
> can always refuse to work if it is not possible). But it IS a problem the
> other way round. Say someone on iso-8859-2 system creates a file named 'kř'
> (k&#rcaron; for those who can't display that character). And someone else on
> iso-8859-1 system tries to check it out. Then bzr should not just throw up
> it's hands and say it's not possible.

But what would you have it do? It has no way to legally represent the
file on the local system. I can think of possible workarounds, but what
would you recommend?

Also, we will run into it in other places. For example, Windows does not
allow many characters (", \, :, *, etc) which are legitimate under unix.
So we need some sort of resolution. Which could be not allowing checking
out a current version which has bogus names, but allowing a checkout
even if the names used to be invalid.
Or somehow munging the names. But if you have munged the names, how do
you munge them. Do you try to do it so that they are still considered
internally to have the old name, the naive implementation would have
them show up as a new file, and the old version being deleted. A second
method would just have them automatically renamed.

I would imagine that there might be a valid near replacement for k&#rcaron;

But I'm also positive that I can write something in arabic, which has no
replacement in latin-1. عرباش

I believe on windows (NT/XP) the real encoding is actually UTF-16, so it
shouldn't be a problem there.

> 
>> And if people scream, we can go to a more complex approach of requiring
>> versioned files to be unicode, but not unversioned files in the tree.
>>
>> And if people scream, we can find ways to jam binary data into unicode,
>> in one of the user-defined sections.
> 
> Well, 'latin-1' can always be decoded to unicode, so that part is not too
> hard.
> 

Sure, but then you always have to decode it into something, which can
get really ugly.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051130/a83ec08a/attachment.pgp 


More information about the bazaar mailing list