version-info --include-history UnicodeDecodeError (518609)

Fri Apr 9 00:00:31 BST 2010

On Thu, 2010-04-08 at 23:31 +0100, Martin (gzlist) wrote:
> > This doesn't really have anything to do with readable text, does it ?
> > bzr version-info is a tool to output *source code* to a file, for
> > inclusion in a program. The RIO format for version-info uses RIO as the
> > encoding of that file, and that means that the strings within it shall
> > be UTF-8. You can read a RIO file, to get unicode data, and then encode
> > it to CP932 whenever you want to show it on the console.
> 
> If the intention is that it should always output binary data to a
> file, the command should take a file name as an argument rather than
> print junk (in this problematic Works For Me way) on the console by
> default.

version info has a number of different formatters available; some may
make sense only as files, some could encode to the local encoding. I
don't think thats ideal; perhaps we could make the formatter 'know' that
it is writing to stdout and error unless stdout is a pipe (and thus
presumably being redirected). That seems fragile though.

> > Not 'bzr log' but rather ~/.bzr.log - the log file bzr writes to. There
> > may have been confusion earlier in the thread. .bzr.log is primarily
> > used in giving bug reports to us; it needs to be reliably decoded even
> > if successive bzr invocations used different encodings, otherwise we
> > can't use it as a sensible source of info.
> 
> Well, this wasn't a bug about .bzr.log (which 'knows' it's UTF-8 so
> giving it unicode objects is fine), but about the console. 

For some reason, some part of the thread looked like it was talking
about bzr.log stuff to me.

> If a
> command is printing textual data to the console, I think it should be
> readable.

I agree. Note though that stdout isn't necessarily the console - it can
be a file - which is how I think version-info expects to be used:
$ bzr version-info --rio > foo.rio

John, who wrote version-info as a plugin, can say more about this.

> > Not at all. I want to see bzr work well for everyone.
> 
> That's clearly the wish, the question is whether development practice
> is thwarting it.

I feel like you're upcasting a specific bug, in non-core code, as a much
larger issue than it is.

> > That looks like bzrlib.rio.Stanza.to_lines is incorrectly returning
> > 'unicode' rather than 'str' line objects, it should (per the docstring)
> > be returning 'str' lines encoded in utf8. Looking at the code it appears
> > to me that the 'tag' variable is the most likely culprit: a unicode tag
> > would cause implicit upcasting of individual lines.
> 
> No, the problem is already diagnosed, see the bug and earlier message
> in the thread. It's clear from inspection of the traceback, or
> bzrlib.version_info_formats.format_rio, or by looking at the
> implementation of bzrlib.rio.check_tag (in bzrlib._rio_pyx?.pyx?) that
> the tags being of type unicode is not the issue.

The bug isn't particularly helpful; especially as it isn't a display
command like log is: its outputting encoded data, not display data.

Using Stanza.to_unicode will get unicode strings, but if these are then
encoded into (say) cp932, RIO won't parse them. RIO only knows how to
parse utf8 streams, or unicode streams.

In my copy of trunk I cannot see a 'check_tag' method, and bzr-search
doesn't find it in history anywhere.

If the problem is that we're writing utf-8 str objects to a transcoding
file, we shouldn't do that for this command; its not appropriate. 

I think you are right, that using a filename in the command avoid the
question about what to do here.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20100409/79187fb0/attachment.pgp