[MERGE] Make annotate behave in a non-ASCII world

Wed Jul 11 02:41:22 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 7/7/07, Adeodato Simó <dato at net.com.org.es> wrote:
>> * Aaron Bentley [Fri, 06 Jul 2007 13:37:39 -0400]:
>>
>> > Adeodato Simó wrote:
>>
>> > > I think this, or some other solution, is a must have, even if
>> everybody
>> > > prefers gannotate these days. ;-)
>>
>> > > +        try:
>> > > +            to_file.write(anno)
>> > > +        except UnicodeEncodeError:
>> > > +            to_file.write(anno.encode(to_file.encoding, 'replace'))
>>
>> > Could you say why you're trying and catching the exception here?  I
>> > think it would be better to encode the string unconditionally.
>>
>> Because we want the 'replace' encoding to happen *only* if the to_file
>> object can't handle the characters in anno (which happens if for example
>> a user with LANG=C annotates a file where one commiter had non-ascii
>> characters).
>>
>> Unconditionally encoding is not desirable (as John also pointed out to
>> me the first time) because users with the appropriate $LANG would see an
>> unnecessarily mutilated string (which in the case of non-latin scripts
>> would be the whole string).
>>
>> Hope the explanation was clear.
> 
> Sorry, I still don't understand it.  I think that writing a unicode
> string to a file is the same as encoding it in the file's encoding,
> then writing that byte string?  If all the unicode characters are
> representable in that encoding, then the first attempt will succeed.
> If any of them are not representable then it will fail and we'll redo
> it and replace those characters.  How is this different to just
> passing errors=replace in the first place?
> 

sys.stdout defaults to errors='strict', and if we wrap it in a codec wrapper,
then the codec wrapper will decode the plain strings we write. (Which is very
bad since we would write based on whatever the file's encoding is so if there
are any non-ascii characters it will fail).

So the problem is that we need to write some things is sys.stdout.encoding
(with errors='replace') and some things as 8-bit strings.

Also, we may not be writing to sys.stdout. Since to_file could be anything (it
just defaults to sys.stdout for 'bzr annotate').

Anyway, we could just unconditionally encode, it just seemed better to do it as
a fallback.

Also, I thought I asked to have a "to_file.encoding or 'ascii'" since
to_file.encoding can be None.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGlDUzJdeBCYSNAAMRApQPAJ9PB3yW2fQKN8CPdNtymBSbbkw8cACfT1TL
buQkX0Yt6mW8NLLXWYUKcjI=
=r9e+
-----END PGP SIGNATURE-----