version-info --include-history UnicodeDecodeError (518609)

Thu Apr 8 22:52:48 BST 2010

On Thu, 2010-04-08 at 22:23 +0100, Martin (gzlist) wrote:
> On 06/04/2010, Robert Collins <robertc at robertcollins.net> wrote:
> >
> > RIO is a low level encoding: it should be outputting RIO (which builds
> > on utf8), not unicode.
> >
> > That is to say - if the console were to be in shift-JIS or something,
> > outputting unicode would *not* result in correct RIO output.
> 
> That's funny, the only documentation I found in the code talks about
> "Unicode" but doesn't mention UTF-8 at all. 

Unicode is not UTF-8: unicode is the in memory representation of strings
in python (sadly python2.x doesn't as clearly separate bytes and strings
as python3.x does, but the model isn't substantially changed between
them - just the separation).

UTF-8 is mentioned:

bzrlib.rio.Stanza.to_lines:
        """Generate sequence of lines for external version of this file.

        The lines are always utf-8 encoded strings.
        """

> I don't interpret "The
> format itself does not deal with character encoding issues, though the
> result will normally be written in Unicode." as a ban on printing
> readable text on a CP932 console.

This doesn't really have anything to do with readable text, does it ?
bzr version-info is a tool to output *source code* to a file, for
inclusion in a program. The RIO format for version-info uses RIO as the
encoding of that file, and that means that the strings within it shall
be UTF-8. You can read a RIO file, to get unicode data, and then encode
it to CP932 whenever you want to show it on the console.

> > bzr's log is defined as being utf8, so we shouldn't need a replace
> > statement there.
> 
> Is it? So, when I type `bzr log` on my console and get readable text
> rather than UTF-8 mangling, that's a bug?

Not 'bzr log' but rather ~/.bzr.log - the log file bzr writes to. There
may have been confusion earlier in the thread. .bzr.log is primarily
used in giving bug reports to us; it needs to be reliably decoded even
if successive bzr invocations used different encodings, otherwise we
can't use it as a sensible source of info.

'bzr log' should absolutely be localised for the user running bzr.

> Personally, I find the current state of affairs unacceptable. Because
> many bazaar developers use UTF-8 consoles and files with English-only
> text, it's "easy" to define various operations and 'internal'
> bytestrings as being UTF-8 without actually ensuring that's the case,
> or that it leads to sensible behaviour for anyone else. But y'all seem
> satisfied with a long stream of similar bug reports from Japanese
> users about these broken assumptions.

Not at all. I want to see bzr work well for everyone.

> This particular bug is a regression of sorts, the operation used to
> give mojibake, and now throws (attached,
> bzr_version_info_failure.log), note also log behaves as (un?)expected.

That looks like bzrlib.rio.Stanza.to_lines is incorrectly returning
'unicode' rather than 'str' line objects, it should (per the docstring)
be returning 'str' lines encoded in utf8. Looking at the code it appears
to me that the 'tag' variable is the most likely culprit: a unicode tag
would cause implicit upcasting of individual lines.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20100409/40395cd7/attachment-0001.pgp