[BUG] bzr changeset generation fails with non-ascii characters

Sat Jul 16 06:42:13 BST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15 Jul 2005, at 13:50, Aaron Bentley wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi all,
>
> My python installation thinks 'ascii' is a good character encoding,  
> and
> who am I to argue?  This means that William Dodé is my constant  
> nemesis,
> because wherever his distinctly non-ascii name appears, trouble is  
> sure
> to follow.
>
> In this case, I get an error with bzr changeset (full traceback  
> below).
>   Essentially, it says that  bzrlib.diff.internal_diff can't convert
> 0xc3 (acute e) to ASCII.  That sounds fair enough, but what may not be
> obvious here is that it shouldn't need to.  iternal_diff should be
> operating in a binary/8-bit fashion on all sequence data-- otherwise,
> you can get lossy character conversions, or errors because a certain
> Unicode codepoints are undefined.  Bzr isn't interested in these files
> as text; it's their byte streams that matter.
>
> So we need to figure out what is provoking unicode handling of this
> data, and get it to use and 8-bit, encoding-ignorant approach instead.

The problem of course is line 103 of changeset/__init__.py:

         outf = codecs.getwriter(user_encoding)(sys.stdout,
                 errors='replace')

FWIW, I agree that the cset should be treated as being in no encoding  
(using whatever encoding is used for each file), and that means being  
8-bit clean with no codec.  In my python (2.3.5) it appears that you  
can write 8-bit clean data to sys.stdout even in US-ASCII mode  
(LC_ALL=C) so I bet just removing that codec line above will fix it.

robey

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Darwin)

iD8DBQFC2J45QQDkKvyJ6cMRAkFOAKDErZY9x1mhOz7c6Q8GbZSJyWPhCgCggiP7
SaTIpeqwvtvUG5FAT5KKIgc=
=eNot
-----END PGP SIGNATURE-----