i18n and file systems

Wed Dec 14 20:46:26 GMT 2005

Robert Collins пишет:
[skip]
> But it raises an interesting discussion we've kindof ignored. Firstly
> the background:
> 
> Some file systems/platforms are unicode through and through - no matter
> what your terminal encoding is, the file system can still represent and
> return an unicode path. (Whether python figures this out and uses the
> appropriate apis is a good question). Examples are NTFS(on win32)
> (IIRC), and HFS+(with MacOSX). Lets call this unicode safe.

I think that FAT32 also unicode safe because it store filenames in MBCS 
(~= Unicode).

[skip]

> Now for the interesting bits :).
> 
> Firstly, I think we should be aiming to ensure that *no matter what*,
> files that bzr creates are named such that all such environments
> described above pun the filename as having the same value. Thats
> essentially 7 bit ascii (the places this breaks are sufficiently far
> between IME that we can ignore them).
> 
> At the moment we *may* do that but we should go further: 
>  * We should write tests that check that regardless of revision-id
> value, or file-id value, the stores do not request non-ascii characters
> of paths from the transport layer. (Volunteers sought!) This involves
> teaching the stores to escape for the transport as part of the
> id->filename mapping *before* the url encoding is put on. 

I'm not completely understand your thought.

id -- you means file id, that at this time is filename of weave files in 
.bzr directory? How this map should be done? What rule for id 
transforming should be used? It should be transport-independent 
solution, is not?

This map is hardcoded as set of rules, or be created and grown from 
revision to revision and stored in some control file that live in .bzr? 
How to avoid possible intersections of mapped filenames?

Anyway it will be good point.

> That means that no matter where it is, a .bzr dir and its contents will
> look the same to us, so we are insulated from the coding effects.

[skip]

> Alexander - I explicitly copied you because I think you probably have
> the most complex setup of a bzr contributor at the moment, and are ideal
> to provide input/testing into this.

I may point your attention to another similar problem: revision naming 
(revision id). At this moment it possible to assign non-ascii revision 
id to some revision. Per example, if bzr auto-create email of user, and 
user name (as login to system on Windows, per example) may be entered in 
non-english language (Windows allows this, and I knows that often 
russian windows users use this feature). In this case revision id have 
non-ascii safe id. So, this revision id also should be converted to 
ascii-safe form (7-bit).

Right now for Windows most of the problems with non-ascii filenames can 
be solved by always using unicode python functions (unicode form or 
passing unicode filepaths as arguments). So I can produce tests, but not 
able to test on Linux with `unicode-sometimes' filesystems. Because 
windows filesystem seems to be unicode safe in core.

Alexander