version-info --include-history UnicodeDecodeError (518609)
Parth Malwankar
parth.malwankar at gmail.com
Fri Apr 9 06:31:47 BST 2010
On Fri, Apr 9, 2010 at 4:30 AM, Robert Collins
<robertc at robertcollins.net> wrote:
> On Thu, 2010-04-08 at 23:31 +0100, Martin (gzlist) wrote:
>> > This doesn't really have anything to do with readable text, does it ?
>> > bzr version-info is a tool to output *source code* to a file, for
>> > inclusion in a program. The RIO format for version-info uses RIO as the
>> > encoding of that file, and that means that the strings within it shall
>> > be UTF-8. You can read a RIO file, to get unicode data, and then encode
>> > it to CP932 whenever you want to show it on the console.
>>
>> If the intention is that it should always output binary data to a
>> file, the command should take a file name as an argument rather than
>> print junk (in this problematic Works For Me way) on the console by
>> default.
>
> version info has a number of different formatters available; some may
> make sense only as files, some could encode to the local encoding. I
> don't think thats ideal; perhaps we could make the formatter 'know' that
> it is writing to stdout and error unless stdout is a pipe (and thus
> presumably being redirected). That seems fragile though.
>
>> > Not 'bzr log' but rather ~/.bzr.log - the log file bzr writes to. There
>> > may have been confusion earlier in the thread. .bzr.log is primarily
>> > used in giving bug reports to us; it needs to be reliably decoded even
>> > if successive bzr invocations used different encodings, otherwise we
>> > can't use it as a sensible source of info.
>>
>> Well, this wasn't a bug about .bzr.log (which 'knows' it's UTF-8 so
>> giving it unicode objects is fine), but about the console.
>
> For some reason, some part of the thread looked like it was talking
> about bzr.log stuff to me.
>
>> If a
>> command is printing textual data to the console, I think it should be
>> readable.
>
> I agree. Note though that stdout isn't necessarily the console - it can
> be a file - which is how I think version-info expects to be used:
> $ bzr version-info --rio > foo.rio
>
> John, who wrote version-info as a plugin, can say more about this.
>
>> > Not at all. I want to see bzr work well for everyone.
>>
>> That's clearly the wish, the question is whether development practice
>> is thwarting it.
>
> I feel like you're upcasting a specific bug, in non-core code, as a much
> larger issue than it is.
>
>> > That looks like bzrlib.rio.Stanza.to_lines is incorrectly returning
>> > 'unicode' rather than 'str' line objects, it should (per the docstring)
>> > be returning 'str' lines encoded in utf8. Looking at the code it appears
>> > to me that the 'tag' variable is the most likely culprit: a unicode tag
>> > would cause implicit upcasting of individual lines.
>>
>> No, the problem is already diagnosed, see the bug and earlier message
>> in the thread. It's clear from inspection of the traceback, or
>> bzrlib.version_info_formats.format_rio, or by looking at the
>> implementation of bzrlib.rio.check_tag (in bzrlib._rio_pyx?.pyx?) that
>> the tags being of type unicode is not the issue.
>
> The bug isn't particularly helpful; especially as it isn't a display
> command like log is: its outputting encoded data, not display data.
>
> Using Stanza.to_unicode will get unicode strings, but if these are then
> encoded into (say) cp932, RIO won't parse them. RIO only knows how to
> parse utf8 streams, or unicode streams.
>
> In my copy of trunk I cannot see a 'check_tag' method, and bzr-search
> doesn't find it in history anywhere.
>
> If the problem is that we're writing utf-8 str objects to a transcoding
> file, we shouldn't do that for this command; its not appropriate.
>
> I think you are right, that using a filename in the command avoid the
> question about what to do here.
>
So as I understand it, in its current design RIO is a binary internal
format thats meant to be written to a file. I chatted with lifeless
briefly on irc and it seems we could go with specifying a filename
to avoid confusion with the version-info command.
====
<parthm> lifeless: yes. just noticed that. so rio is more of an internal
binary format (ascii + utf-8) not really meant for stdout. so what
would be a good fix for this?
<lifeless> parthm: as per the list/bug - change the command to take
a filename
permit '-' for people to that want to use it
<parthm> in 'version-info [location]' with location being optional we
probably need a option --output or something. so if the file is open-ed
with the correct encoding that should take care of making rio readable?
<lifeless> if we are breaking compat
change it to version-info filename [location]
or
change it to version-info filename [-d location]
(-d is a common argument we want to support for 'look elsewhere for stuff')
<parthm> lifeless: sounds fine to me. so this is probably more of a
2.3 change. i still need to understand rio better will probably spend some
more time on that.
<lifeless> ok
<parthm> i will discuss this on the mailing list in case there are some
more suggestions/refinements.
====
Regards,
Parth
More information about the bazaar
mailing list