[BUG] bzr changeset generation fails with non-ascii characters
John A Meinel
john at arbash-meinel.com
Sat Jul 16 16:11:18 BST 2005
Robey Pointer wrote:
...
>>> So we need to figure out what is provoking unicode handling of this
>>> data, and get it to use and 8-bit, encoding-ignorant approach instead.
>
>
> The problem of course is line 103 of changeset/__init__.py:
>
> outf = codecs.getwriter(user_encoding)(sys.stdout,
> errors='replace')
>
> FWIW, I agree that the cset should be treated as being in no encoding
> (using whatever encoding is used for each file), and that means being
> 8-bit clean with no codec. In my python (2.3.5) it appears that you
> can write 8-bit clean data to sys.stdout even in US-ASCII mode
> (LC_ALL=C) so I bet just removing that codec line above will fix it.
Well, there are 2 pieces to a changeset. The reason I used unicode
encoding, is because the meta-information (commit messages, user names,
etc) are all written in unicode (stored as utf-8).
I agree that the patches, etc, should all be considered untranslated
8-bit blobs. I will make a change, but we should determine how we want
to handle the meta information.
I'm thinking that probably we can just standardize on "meta-information
is utf-8 encoded", and "patches are untranslated".
Does that seem reasonable? The current method would try to translate
meta information into the user's local preferred encoding, but since it
is a format that is meant to be given to someone else, it seems that
utf-8 encoding might be best.
Of course, there are 2 sections of meta-information.
There is the text in # committer:
And there is the entries in
*** added file myfile.txt
Right now for:
*** added file
We use unicode escaping. So non-ascii characters will get written out as
"\u2233".
I don't think we want to use this encoding for the headers, otherwise
you would get:
# committer: William Dod\xe9
I think this is okay for the filename meta-information, so you get:
*** modified file myfil\xe9.txt
Or would we prefer to change the meta-information lines also to utf-8?
The more I think about it, the more I'm okay with doing all meta
information as utf-8.
At one point we needed the above handling, because of parsing reasons.
But since I have switched to using ' // ' as the entry separator, I
think we'll be okay.
I'll commit a change today. It should be something like revision number
87 (I'm on 86 right now).
John
=:->
>
> robey
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050716/d2805d51/attachment.pgp
More information about the bazaar
mailing list