[BUG] Unicode string must be always used with encodings

Tue Sep 27 21:53:29 BST 2005

Alexander Belchenko wrote:
> John A Meinel пишет:
> 
>>
>> So back to Alexander's comment. What is the actual error when you use
>> bzr to list those directories? So that we can track it down, and make
>> sure it works.
> 
> 
> It is *almost* works, and in the latest 0.0.9 not fails, but works with
> bug:

Well, on Mac, I'm getting really weird errors. Apparently, the terminal
knows how to handle utf-8 encoded names, so I can do:
$ bzr init
$ bzr status
unknown:
  جوجو.txa
$ bzr add
$ bzr add
added "جوجو.txa"
$ bzr status
added:
  bzr: ERROR: 'ascii' codec can't encode characters in position 0-7:
ordinal not in range(128)
  at /Users/jameinel/dev/bzr/bzr.dev/bzrlib/delta.py line 98, in show_list()
  see ~/.bzr.log for debug information

Now, if I just check locale.getpreferredencoding() it says "mac-roman".
But if I try to print a unicode string, it tells me the same "ascii
codec can't encode characters" stuff.

I *did* find that by using "export LANG='en_US.utf-8'" I was able to
make python think that the sys.stdout encoding was utf-8, rather than
ascii. After that, sys.stdout now has 'utf-8' as it's encoding, and
things work. Though locale.getpreferredencoding() still returns the
bogus "mac-roman".

What I'm seeing, is that the original bzr status is treating the file as
just a blob of characters (which happen to be utf-8 encoded), same with
the initial add.

However, these characters are then sent to ElementTree, and when they
are read back in, they are considered "Unicode" characters, not utf-8
characters.

So the attached patch changes different locations that recurse through
the filesystem so that they traverse in Unicode mode, rather than in
regular string mode.

After applying the attached patch to use unicode searching, and setting
my LANG appropriately, it bzr works properly on my machine.

There is still a problem if you try to commit without a commit message,
because it creates a StringIO and grabs the output of show_status(). The
problem is that a cStringIO.StringIO() is an ascii codec. The good news
is that StringIO.StringIO() seems capable of handling unicode.

For example, try this:
import cStringIO, StringIO
c = cStringIO.StringIO()
c.write('test')
c.write(u' this')
c.getvalue()
Notice that the returned value is a standard string. But if you do:
s = StringIO.StringIO()
s.write('test')
s.write(u' this')
s.getvalue()
The return value is a Unicode string.

Now, we still have some problems when writing to files, so we might need
to fall back to using "codecs.open()"

I'm also getting weird failures when trying to handle unicode inside of
files. If I have some unicode inside a file, and try to do:
print line

It works fine, but if I do "sys.stdout.write(line)" it complains about
not being able to translate the unicode characters into ascii.

However, shouldn't the diff libraries be treating the strings as ascii
strings, and not unicode anyway?

John
=:->
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unicode-filenames.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20050927/39ce7165/attachment.diff 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050927/39ce7165/attachment.pgp