[rfc] Help needed: encodings problems again

Mon Jun 12 18:12:04 BST 2006

Alexander Belchenko wrote:
> Despite that fact that encodings work is landed into bzr.dev there is
> too much places in bzrlib where sys.stdout used directly and therefore
> it create problems with unicode strings when output redirected via pipe
> and sys.stdout hide real encoding of console.
> 
> I faced with problems in execution of command
> 
> bzr missing | less
> 
> when in revisions messages appears russian characters.
> 
> When I grep through bzrlib for 'sys.stdout' I have list of about 70
> lines where sys.stdout used. And most of this lines is potentially
> caused problems when output of bzr will be redirected. And this is very
> sad.
> 
> Fixing improper usage of sys.stdout in builtins command is very trivial,
> because we have attribute self.outf in all descendant of Command class.
> But I don't understand how to deal with other places that itself
> unrelated to Command class.
> 
> Help needed.
> 
> Attached patch solve problems with builtins command. Attached grep
> output showed lines where sys.stdout appears.
> 
> -- 
> Alexander
> 

Well, we actually have to watch out for both 'sys.stdout' and 'print'.
Since both of the print directly to the output, and not to a nicely
wrapped file.

The specific methods for how to handle it is that things that write to
the screen should be modified so that they are now:

def foo(..., to_file=None):
  if to_file is None:
    # DEPRECATED WARNING
    to_file = sys.stdout

That lets the builtins upgrade them properly, without breaking API
compatibility at this point.

Also, the real goal is to write new tests in
bzrlib/tests/blackbox/test_non_ascii.py

This does lots of work to force the commands to run in encodings that
they are not happy about. And tests that the appropriate return code is
raised.

I got to maybe 50% of the commands, so there are certainly more that
should be fixed. We wanted to get it merged so that other people could
work on it as well.

The other reason to write them as test cases is because 'bzr missing'
should use "encoding_type='replace'" which says that it is okay if it
cannot represent the expected characters, while 'bzr ls' should fail if
it cannot properly represent the files. It just has to do with whether
the output is designed more for user edification, versus something you
would pipe into xargs, or other things. (cat-revision would be 'strict',
bzr diff and bzr bundle are actually 'exact' which says that there
should be no transparent encoding going on, because we will be putting
out raw bytestreams)

So I'm +0.5 on this change, since it fixes some more cases. But I'd be
fully +1 if you could write a few more tests in test_non_ascii.py

John
=:->

PS> Don't interpret to_file too early. If the specific function doesn't
actually print anything, but just passes it down to another function
which does, then you probably shouldn't replace it with sys.stdout.

> 
> ------------------------------------------------------------------------
> 
> === modified file 'bzrlib/builtins.py'
> --- bzrlib/builtins.py	2006-06-08 06:38:32 +0000
> +++ bzrlib/builtins.py	2006-06-12 16:32:42 +0000
> @@ -1133,7 +1133,7 @@
>                  raise BzrCommandError('bzr diff --revision takes exactly one or two revision identifiers')
>          else:
>              if tree2 is not None:
> -                return show_diff_trees(tree1, tree2, sys.stdout, 
> +                return show_diff_trees(tree1, tree2, self.outf,
>                                         specific_files=file_list,
>                                         external_diff_options=diff_options,
>                                         old_label=old_label, new_label=new_label)
> @@ -2354,7 +2354,7 @@
>                  if (log_format == None):
>                      default = bzrlib.config.BranchConfig(local_branch).log_format()
>                      log_format = get_log_format(long=long, short=short, line=line, default=default)
> -                lf = log_formatter(log_format, sys.stdout,
> +                lf = log_formatter(log_format, self.outf,
>                                     show_ids=show_ids,
>                                     show_timezone='original')
>                  if reverse is False:
> @@ -2431,9 +2431,9 @@
>                  rev_id = revision[0].in_history(b).rev_id
>              t = Testament.from_revision(b.repository, rev_id)
>              if long:
> -                sys.stdout.writelines(t.as_text_lines())
> +                self.outf.writelines(t.as_text_lines())
>              else:
> -                sys.stdout.write(t.as_short_text())
> +                self.outf.write(t.as_short_text())
>          finally:
>              b.unlock()
>  
> @@ -2473,7 +2473,7 @@
>              file_id = tree.inventory.path2id(relpath)
>              tree = branch.repository.revision_tree(revision_id)
>              file_version = tree.inventory[file_id].revision
> -            annotate_file(branch, file_version, file_id, long, all, sys.stdout)
> +            annotate_file(branch, file_version, file_id, long, all, self.outf)
>          finally:
>              branch.unlock()
>  
> @@ -2604,7 +2604,7 @@
>  
>          for r in range(revno, b.revno()+1):
>              rev_id = b.get_rev_id(r)
> -            lf = log_formatter('short', to_file=sys.stdout,show_timezone='original')
> +            lf = log_formatter('short', to_file=self.outf, show_timezone='original')
>              lf.show(r, b.repository.get_revision(rev_id), None)
>  
>          if dry_run:
> 

I can go through the list that you are mentioning here, but I'm guessing
that most of these are actually valid.
If you added some context, it would be more obvious. Try 'grep -C2'.

All comments follow the lines I'm talking about.

> ------------------------------------------------------------------------
> 
> bzrlib\add.py:61                  self._to_file = sys.stdout

This is a compatibility statement.

> bzrlib\annotate.py:40              to_file = sys.stdout

as is this

> bzrlib\builtins.py:328             #       sys.stdout encoding cannot represent it?

this is in a comment, which should probably be removed anyway.

> bzrlib\builtins.py  1136                    return show_diff_trees(tree1, tree2, sys.stdout, 
> bzrlib\builtins.py  2357                    lf = log_formatter(log_format, sys.stdout,
> bzrlib\builtins.py  2434                    sys.stdout.writelines(t.as_text_lines())
> bzrlib\builtins.py  2436                    sys.stdout.write(t.as_short_text())
> bzrlib\builtins.py  2476                annotate_file(branch, file_version, file_id, long, all, sys.stdout)
> bzrlib\builtins.py  2607                lf = log_formatter('short', to_file=sys.stdout,show_timezone='original')

I'm pretty sure you fixed these in your patch.

> bzrlib\commands.py:201                 exact - do not encode sys.stdout
> bzrlib\commands.py  233             # *way* too much like sys.stdout
> bzrlib\commands.py  235                 self.outf = sys.stdout
> bzrlib\commands.py  238             output_encoding = getattr(sys.stdout, 'encoding', None)
> bzrlib\commands.py  248                 mutter('encoding stdout as sys.stdout encoding %r', output_encoding)
> bzrlib\commands.py  252             self.outf = codecs.getwriter(output_encoding)(sys.stdout, errors=self.encoding_type)
> bzrlib\commands.py  669                 sys.stdout.flush()
> bzrlib\commands.py  701                 sys.stdout.flush()

ALl the above seem to be valid, and need to use sys.stdout, or are in
comments/docstrings.

> bzrlib\diff.py:86          if to_file != sys.stdout:

This is something someone else worked on, but is valid for now.

> bzrlib\diff.py  178             output = sys.stdout

a compatibility workaround.

> bzrlib\diff.py  225         output = sys.stdout

This one seems genuinely bogus

> bzrlib\diff.py  239         return show_diff_trees(old_tree, new_tree, sys.stdout, specific_files,

this one is bogus too.

> bzrlib\help.py:57              outfile = sys.stdout
> bzrlib\help.py  94              outfile = sys.stdout
> bzrlib\help.py  122             outfile = sys.stdout
> bzrlib\help.py  146             outfile = sys.stdout

these all seem to be compatibility checks.

> bzrlib\log.py:510             to_file = codecs.getwriter(bzrlib.user_encoding)(sys.stdout, errors='replace')

this is probably bogus, since the user_encoding should be happening much
higher up. (And it should be using sys.stdout.encoding, not
user_encoding, etc).

> bzrlib\lsprof.py:42                  file = sys.stdout

This is probably okay. Since we have to write out the results somewhere,
and this occurs before we are running any command.

> bzrlib\merge3.py:426         # sys.stdout.writelines(m3.merge_lines(name_a=argv[1], name_b=argv[3]))
> bzrlib\merge3.py  427         sys.stdout.writelines(m3.merge_annotated())

bogus. merge3 should take a to_file.

> bzrlib\patiencediff.py:370             sys.stdout.write(line)

This is fine, because it is in a custom 'main()' function.

> bzrlib\progress.py:87                  to_messages_file = sys.stdout
> bzrlib\progress.py  149                 to_messages_file = sys.stdout

compatibility lines

> bzrlib\shellcomplete.py:6               outfile = sys.stdout
> bzrlib\shellcomplete.py  16              outfile = sys.stdout
> bzrlib\shellcomplete.py  36              outfile = sys.stdout
> bzrlib\shellcomplete.py  54              outfile = sys.stdout

same here.

> bzrlib\status.py:113             to_file = sys.stdout
> bzrlib\textui.py:35              to_file = sys.stdout

and both of these

> bzrlib\tree.py:106             sys.stdout.write(self.get_file_text(file_id))

This is bogus.

> bzrlib\weave.py:1360            sys.stdout.writelines(w.get_iter(int(argv[3])))
> bzrlib\weave.py  1371            sys.stdout.writelines(diff_gen)
> bzrlib\weave.py  1417            sys.stdout.writelines(w.weave_merge(p))

These are also a custom 'main()' function.

> bzrlib\weave_commands.py:58              write_weave(w1, sys.stdout)
> bzrlib\weave_commands.py  90              sys.stdout.writelines(w.weave_merge(p))

weave_commands is an extension of builtins, and should be using 'self.outf'.

> bzrlib\bundle\commands.py:141                 fileobj = sys.stdout
> bzrlib\bundle\commands.py  166             #serializer_v4.write(cset_tree.inventory, sys.stdout)

Also an extension of builtins.

> bzrlib\tests\__init__.py:378         Internally we can check sys.stdout to see what the output encoding
> bzrlib\tests\__init__.py  658             :param encoding: encoding for sys.stdout and sys.stderr
> bzrlib\tests\__init__.py  788             real_stdout = sys.stdout
> bzrlib\tests\__init__.py  791                 sys.stdout = stdout
> bzrlib\tests\__init__.py  796                 sys.stdout = real_stdout
> bzrlib\tests\__init__.py  1145        runner = TextTestRunner(stream=sys.stdout,

probably these are okay, since it is running the test suite.

> bzrlib\tests\test_gpg.py:56                  # It is too much work to make sys.stdout be in binary mode.
> bzrlib\tests\test_gpg.py  59                          'import sys; sys.stdout.write(sys.stdin.read())']
> bzrlib\tests\test_patches.py:153                             sys.stdout.write(" orig:%spatch:%s" % (next,
> bzrlib\tests\test_ui.py:52              # TODO: Test the output from TextUIFactory, perhaps by overriding sys.stdout
> bzrlib\tests\blackbox\__init__.py:116                 self.stdout = sys.stdout
> bzrlib\tests\blackbox\test_log.py:250             # encoded to sys.stdout.encoding
> bzrlib\tests\blackbox\test_selftest.py:139             self.assertNotEqual(sys.stdout, self.factory.stdout)
> bzrlib\tests\blackbox\test_status.py:226             self.stdout = sys.stdout
> bzrlib\tests\blackbox\test_status.py  230             sys.stdout = self.stdout
> bzrlib\tests\blackbox\test_status.py  246             sys.stdout = StringIO()
> bzrlib\tests\blackbox\test_status.py  257             sys.stdout = StringIO()
> bzrlib\ui\text.py:46                  self.stdout = sys.stdout
> bzrlib\ui\text.py  75              prompt = (prompt % kwargs).encode(sys.stdout.encoding, 'replace')
> bzrlib\util\elementtree\ElementTree.py:729     # Writes an element tree or element structure to sys.stdout.  This
> bzrlib\util\elementtree\ElementTree.py  741         elem.write(sys.stdout)
> bzrlib\util\elementtree\ElementTree.py  744             sys.stdout.write("\n")
> bzrlib\util\urlgrabber\keepalive.py:498         sys.stdout.write('  first using the normal urllib handlers')
> bzrlib\util\urlgrabber\keepalive.py  505         sys.stdout.write('  now using the keepalive handler       ')
> bzrlib\util\urlgrabber\keepalive.py  545             sys.stdout.write('\r  %2i' % i)
> bzrlib\util\urlgrabber\keepalive.py  546             sys.stdout.flush()

I wouldn't change any of the util code. And probably the tests are fine.
So probably 75% of these are valid compatibility uses, and hopefully
I've made it clear which ones need to be fixed.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060612/3f4ede17/attachment.pgp