version-info --include-history UnicodeDecodeError (518609)

Thu Apr 8 23:31:27 BST 2010

On 08/04/2010, Robert Collins <robertc at robertcollins.net> wrote:
>
> Unicode is not UTF-8: unicode is the in memory representation of strings
> in python

I'm glad you're clear on this, because the documentation in bzrlib.rio
certainly isn't.

> This doesn't really have anything to do with readable text, does it ?
> bzr version-info is a tool to output *source code* to a file, for
> inclusion in a program. The RIO format for version-info uses RIO as the
> encoding of that file, and that means that the strings within it shall
> be UTF-8. You can read a RIO file, to get unicode data, and then encode
> it to CP932 whenever you want to show it on the console.

If the intention is that it should always output binary data to a
file, the command should take a file name as an argument rather than
print junk (in this problematic Works For Me way) on the console by
default.

> Not 'bzr log' but rather ~/.bzr.log - the log file bzr writes to. There
> may have been confusion earlier in the thread. .bzr.log is primarily
> used in giving bug reports to us; it needs to be reliably decoded even
> if successive bzr invocations used different encodings, otherwise we
> can't use it as a sensible source of info.

Well, this wasn't a bug about .bzr.log (which 'knows' it's UTF-8 so
giving it unicode objects is fine), but about the console. If a
command is printing textual data to the console, I think it should be
readable.

> Not at all. I want to see bzr work well for everyone.

That's clearly the wish, the question is whether development practice
is thwarting it.

> That looks like bzrlib.rio.Stanza.to_lines is incorrectly returning
> 'unicode' rather than 'str' line objects, it should (per the docstring)
> be returning 'str' lines encoded in utf8. Looking at the code it appears
> to me that the 'tag' variable is the most likely culprit: a unicode tag
> would cause implicit upcasting of individual lines.

No, the problem is already diagnosed, see the bug and earlier message
in the thread. It's clear from inspection of the traceback, or
bzrlib.version_info_formats.format_rio, or by looking at the
implementation of bzrlib.rio.check_tag (in bzrlib._rio_pyx?.pyx?) that
the tags being of type unicode is not the issue.

Martin