[rfc] Help needed: encodings problems again
John Arbash Meinel
john at arbash-meinel.com
Mon Jun 12 18:12:04 BST 2006
Alexander Belchenko wrote:
> Despite that fact that encodings work is landed into bzr.dev there is
> too much places in bzrlib where sys.stdout used directly and therefore
> it create problems with unicode strings when output redirected via pipe
> and sys.stdout hide real encoding of console.
>
> I faced with problems in execution of command
>
> bzr missing | less
>
> when in revisions messages appears russian characters.
>
> When I grep through bzrlib for 'sys.stdout' I have list of about 70
> lines where sys.stdout used. And most of this lines is potentially
> caused problems when output of bzr will be redirected. And this is very
> sad.
>
> Fixing improper usage of sys.stdout in builtins command is very trivial,
> because we have attribute self.outf in all descendant of Command class.
> But I don't understand how to deal with other places that itself
> unrelated to Command class.
>
> Help needed.
>
> Attached patch solve problems with builtins command. Attached grep
> output showed lines where sys.stdout appears.
>
> --
> Alexander
>
Well, we actually have to watch out for both 'sys.stdout' and 'print'.
Since both of the print directly to the output, and not to a nicely
wrapped file.
The specific methods for how to handle it is that things that write to
the screen should be modified so that they are now:
def foo(..., to_file=None):
if to_file is None:
# DEPRECATED WARNING
to_file = sys.stdout
That lets the builtins upgrade them properly, without breaking API
compatibility at this point.
Also, the real goal is to write new tests in
bzrlib/tests/blackbox/test_non_ascii.py
This does lots of work to force the commands to run in encodings that
they are not happy about. And tests that the appropriate return code is
raised.
I got to maybe 50% of the commands, so there are certainly more that
should be fixed. We wanted to get it merged so that other people could
work on it as well.
The other reason to write them as test cases is because 'bzr missing'
should use "encoding_type='replace'" which says that it is okay if it
cannot represent the expected characters, while 'bzr ls' should fail if
it cannot properly represent the files. It just has to do with whether
the output is designed more for user edification, versus something you
would pipe into xargs, or other things. (cat-revision would be 'strict',
bzr diff and bzr bundle are actually 'exact' which says that there
should be no transparent encoding going on, because we will be putting
out raw bytestreams)
So I'm +0.5 on this change, since it fixes some more cases. But I'd be
fully +1 if you could write a few more tests in test_non_ascii.py
John
=:->
PS> Don't interpret to_file too early. If the specific function doesn't
actually print anything, but just passes it down to another function
which does, then you probably shouldn't replace it with sys.stdout.
>
> ------------------------------------------------------------------------
>
> === modified file 'bzrlib/builtins.py'
> --- bzrlib/builtins.py 2006-06-08 06:38:32 +0000
> +++ bzrlib/builtins.py 2006-06-12 16:32:42 +0000
> @@ -1133,7 +1133,7 @@
> raise BzrCommandError('bzr diff --revision takes exactly one or two revision identifiers')
> else:
> if tree2 is not None:
> - return show_diff_trees(tree1, tree2, sys.stdout,
> + return show_diff_trees(tree1, tree2, self.outf,
> specific_files=file_list,
> external_diff_options=diff_options,
> old_label=old_label, new_label=new_label)
> @@ -2354,7 +2354,7 @@
> if (log_format == None):
> default = bzrlib.config.BranchConfig(local_branch).log_format()
> log_format = get_log_format(long=long, short=short, line=line, default=default)
> - lf = log_formatter(log_format, sys.stdout,
> + lf = log_formatter(log_format, self.outf,
> show_ids=show_ids,
> show_timezone='original')
> if reverse is False:
> @@ -2431,9 +2431,9 @@
> rev_id = revision[0].in_history(b).rev_id
> t = Testament.from_revision(b.repository, rev_id)
> if long:
> - sys.stdout.writelines(t.as_text_lines())
> + self.outf.writelines(t.as_text_lines())
> else:
> - sys.stdout.write(t.as_short_text())
> + self.outf.write(t.as_short_text())
> finally:
> b.unlock()
>
> @@ -2473,7 +2473,7 @@
> file_id = tree.inventory.path2id(relpath)
> tree = branch.repository.revision_tree(revision_id)
> file_version = tree.inventory[file_id].revision
> - annotate_file(branch, file_version, file_id, long, all, sys.stdout)
> + annotate_file(branch, file_version, file_id, long, all, self.outf)
> finally:
> branch.unlock()
>
> @@ -2604,7 +2604,7 @@
>
> for r in range(revno, b.revno()+1):
> rev_id = b.get_rev_id(r)
> - lf = log_formatter('short', to_file=sys.stdout,show_timezone='original')
> + lf = log_formatter('short', to_file=self.outf, show_timezone='original')
> lf.show(r, b.repository.get_revision(rev_id), None)
>
> if dry_run:
>
I can go through the list that you are mentioning here, but I'm guessing
that most of these are actually valid.
If you added some context, it would be more obvious. Try 'grep -C2'.
All comments follow the lines I'm talking about.
> ------------------------------------------------------------------------
>
> bzrlib\add.py:61 self._to_file = sys.stdout
This is a compatibility statement.
> bzrlib\annotate.py:40 to_file = sys.stdout
as is this
> bzrlib\builtins.py:328 # sys.stdout encoding cannot represent it?
this is in a comment, which should probably be removed anyway.
> bzrlib\builtins.py 1136 return show_diff_trees(tree1, tree2, sys.stdout,
> bzrlib\builtins.py 2357 lf = log_formatter(log_format, sys.stdout,
> bzrlib\builtins.py 2434 sys.stdout.writelines(t.as_text_lines())
> bzrlib\builtins.py 2436 sys.stdout.write(t.as_short_text())
> bzrlib\builtins.py 2476 annotate_file(branch, file_version, file_id, long, all, sys.stdout)
> bzrlib\builtins.py 2607 lf = log_formatter('short', to_file=sys.stdout,show_timezone='original')
I'm pretty sure you fixed these in your patch.
> bzrlib\commands.py:201 exact - do not encode sys.stdout
> bzrlib\commands.py 233 # *way* too much like sys.stdout
> bzrlib\commands.py 235 self.outf = sys.stdout
> bzrlib\commands.py 238 output_encoding = getattr(sys.stdout, 'encoding', None)
> bzrlib\commands.py 248 mutter('encoding stdout as sys.stdout encoding %r', output_encoding)
> bzrlib\commands.py 252 self.outf = codecs.getwriter(output_encoding)(sys.stdout, errors=self.encoding_type)
> bzrlib\commands.py 669 sys.stdout.flush()
> bzrlib\commands.py 701 sys.stdout.flush()
ALl the above seem to be valid, and need to use sys.stdout, or are in
comments/docstrings.
> bzrlib\diff.py:86 if to_file != sys.stdout:
This is something someone else worked on, but is valid for now.
> bzrlib\diff.py 178 output = sys.stdout
a compatibility workaround.
> bzrlib\diff.py 225 output = sys.stdout
This one seems genuinely bogus
> bzrlib\diff.py 239 return show_diff_trees(old_tree, new_tree, sys.stdout, specific_files,
this one is bogus too.
> bzrlib\help.py:57 outfile = sys.stdout
> bzrlib\help.py 94 outfile = sys.stdout
> bzrlib\help.py 122 outfile = sys.stdout
> bzrlib\help.py 146 outfile = sys.stdout
these all seem to be compatibility checks.
> bzrlib\log.py:510 to_file = codecs.getwriter(bzrlib.user_encoding)(sys.stdout, errors='replace')
this is probably bogus, since the user_encoding should be happening much
higher up. (And it should be using sys.stdout.encoding, not
user_encoding, etc).
> bzrlib\lsprof.py:42 file = sys.stdout
This is probably okay. Since we have to write out the results somewhere,
and this occurs before we are running any command.
> bzrlib\merge3.py:426 # sys.stdout.writelines(m3.merge_lines(name_a=argv[1], name_b=argv[3]))
> bzrlib\merge3.py 427 sys.stdout.writelines(m3.merge_annotated())
bogus. merge3 should take a to_file.
> bzrlib\patiencediff.py:370 sys.stdout.write(line)
This is fine, because it is in a custom 'main()' function.
> bzrlib\progress.py:87 to_messages_file = sys.stdout
> bzrlib\progress.py 149 to_messages_file = sys.stdout
compatibility lines
> bzrlib\shellcomplete.py:6 outfile = sys.stdout
> bzrlib\shellcomplete.py 16 outfile = sys.stdout
> bzrlib\shellcomplete.py 36 outfile = sys.stdout
> bzrlib\shellcomplete.py 54 outfile = sys.stdout
same here.
> bzrlib\status.py:113 to_file = sys.stdout
> bzrlib\textui.py:35 to_file = sys.stdout
and both of these
> bzrlib\tree.py:106 sys.stdout.write(self.get_file_text(file_id))
This is bogus.
> bzrlib\weave.py:1360 sys.stdout.writelines(w.get_iter(int(argv[3])))
> bzrlib\weave.py 1371 sys.stdout.writelines(diff_gen)
> bzrlib\weave.py 1417 sys.stdout.writelines(w.weave_merge(p))
These are also a custom 'main()' function.
> bzrlib\weave_commands.py:58 write_weave(w1, sys.stdout)
> bzrlib\weave_commands.py 90 sys.stdout.writelines(w.weave_merge(p))
weave_commands is an extension of builtins, and should be using 'self.outf'.
> bzrlib\bundle\commands.py:141 fileobj = sys.stdout
> bzrlib\bundle\commands.py 166 #serializer_v4.write(cset_tree.inventory, sys.stdout)
Also an extension of builtins.
> bzrlib\tests\__init__.py:378 Internally we can check sys.stdout to see what the output encoding
> bzrlib\tests\__init__.py 658 :param encoding: encoding for sys.stdout and sys.stderr
> bzrlib\tests\__init__.py 788 real_stdout = sys.stdout
> bzrlib\tests\__init__.py 791 sys.stdout = stdout
> bzrlib\tests\__init__.py 796 sys.stdout = real_stdout
> bzrlib\tests\__init__.py 1145 runner = TextTestRunner(stream=sys.stdout,
probably these are okay, since it is running the test suite.
> bzrlib\tests\test_gpg.py:56 # It is too much work to make sys.stdout be in binary mode.
> bzrlib\tests\test_gpg.py 59 'import sys; sys.stdout.write(sys.stdin.read())']
> bzrlib\tests\test_patches.py:153 sys.stdout.write(" orig:%spatch:%s" % (next,
> bzrlib\tests\test_ui.py:52 # TODO: Test the output from TextUIFactory, perhaps by overriding sys.stdout
> bzrlib\tests\blackbox\__init__.py:116 self.stdout = sys.stdout
> bzrlib\tests\blackbox\test_log.py:250 # encoded to sys.stdout.encoding
> bzrlib\tests\blackbox\test_selftest.py:139 self.assertNotEqual(sys.stdout, self.factory.stdout)
> bzrlib\tests\blackbox\test_status.py:226 self.stdout = sys.stdout
> bzrlib\tests\blackbox\test_status.py 230 sys.stdout = self.stdout
> bzrlib\tests\blackbox\test_status.py 246 sys.stdout = StringIO()
> bzrlib\tests\blackbox\test_status.py 257 sys.stdout = StringIO()
> bzrlib\ui\text.py:46 self.stdout = sys.stdout
> bzrlib\ui\text.py 75 prompt = (prompt % kwargs).encode(sys.stdout.encoding, 'replace')
> bzrlib\util\elementtree\ElementTree.py:729 # Writes an element tree or element structure to sys.stdout. This
> bzrlib\util\elementtree\ElementTree.py 741 elem.write(sys.stdout)
> bzrlib\util\elementtree\ElementTree.py 744 sys.stdout.write("\n")
> bzrlib\util\urlgrabber\keepalive.py:498 sys.stdout.write(' first using the normal urllib handlers')
> bzrlib\util\urlgrabber\keepalive.py 505 sys.stdout.write(' now using the keepalive handler ')
> bzrlib\util\urlgrabber\keepalive.py 545 sys.stdout.write('\r %2i' % i)
> bzrlib\util\urlgrabber\keepalive.py 546 sys.stdout.flush()
I wouldn't change any of the util code. And probably the tests are fine.
So probably 75% of these are valid compatibility uses, and hopefully
I've made it clear which ones need to be fixed.
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060612/3f4ede17/attachment.pgp
More information about the bazaar
mailing list