[MERGE] Readable and properly encoded diff headers

Tue Aug 15 17:06:23 BST 2006

Adeodato Simó wrote:
> * Aaron Bentley [Tue, 15 Aug 2006 10:31:40 -0400]:
> 
>> Adeodato Simó wrote:
>>> The attached patch changes %r to '%s' in the headers, and also encodes
>>> the filenames in the proper encoding, not hardcoded 'utf8'.
> 
>> I don't think we should assume that the destination is a terminal. 
> 
> Well, I was initially using bzrlib.user_encoding, and John recommended
> terminal_encoding() instead. In any case, this is not about the output
> being a terminal or not, but about having the headers of diff's output
> in the proper encoding.
> 
>> For example, I believe this functionality is invoked by bzr-gtk's
>> gdiff, and so changing it to something other than utf-8 might be a
>> regression.
> 
> Or the opposite. Do bzr-gtk or gdiff work with non-ascii filenames in
> non-utf8 locales? I'd assume it doesn't, same as current bzr doesn't.

Actually, they do. Because gtk would use Unicode (or utf8) to display on
the screen. Not the terminal encoding. They use a different rendering
engine than the one the terminal uses.

I know for a fact that I've written Unicode aware wxPython code, which
can display Arabic characters, while my terminal is iso-8859-1 (actually
code page 1251 or something weird like that, but close).

> 
> And as for being a regression, current show_diff_trees() always returns
> headers in *ASCII*, never in 8bit (because of %r). So between shoving
> 8bit down applications always in UTF-8, and shoving it in the user's
> encoding, I think the latter is preferable, unless somebody can explain
> me this would break stuff.
> 
> * John Arbash Meinel [Tue, 15 Aug 2006 09:37:41 -0500]:
> 
>> I think what we need is for 'show_diff_trees' to take the path_encoding
>> parameter, and pass it down the line. Rather than have it detected in
>> _show_diff_trees.
> 
>> And then 'cmd_diff' can set path_encoding = osutils.terminal_encoding().
> 
> I don't fancy this much, having the encoding travelling all over the
> place.

Well, you sort of have to. One place knows what the correct encoding is.
And another place needs to know what it is.
If think the default should be path_encoding='utf8' everywhere, and this
just gets overridden by the cmd_diff code.

The code in bzrlib (outside of builtins.py and a few other areas), are
not meant to be aware of what the outside world is. So that you can
layer on a GUI, etc. And a GUI is likely to call 'show_diff_trees()'
certainly they are not going to call '_show_diff_trees()'. They might
re-invent the logic in _show_diff_trees() because we don't export a
function which lets them do what they want.

> 
>> 'self.outf' for cmd_diff is setup *without* encoding, because we don't
>> want to silently transcode the contents of the file. What I would really
>> like is to have a 'do not accept unicode' file-like object when
>> encoding_type='exact'.
> 
>> The to_file should definitely be self.outf. And it would be possible for
>> self.outf to have an encoding wrapper, and to write out the paths in
>> unicode. But I feel it leaves us open to a bug when writing something
>> else as part of 'diff'.
> 
> Wouldn't it be possible to have a file-like object that does: "if I
> receive unicode, I encode it to $user_encoding; if I get strings, I pass
> them unmodified"? Would it make sense?
> 
> Cheers,
> 

Well, we already have that with self.outf and encoding_type='strict'.
(or possibly encoding_type='replace').

That is another thing that you aren't testing. Graceful degradation. If
you have LANG=C, and you do a diff on the file 'جوجو.txt' what should be
output?
Hence the desire for the 'bzr diff --strict' flag.

I really do believe you want to put the parameter on show_diff_trees().
It isn't up to bzrlib/diff.py to determine what the output encoding is.

And we can't just use 'user_encoding' because that is not always
terminal encoding. I'm sorry this is so complicated, but that is the way
encoding is. Yell at Microsoft for having a different
locale.getpreferredencoding() and terminal encoding. I believe the first
is the preferred encoding for file contents, not what will be printed
out on the screen. So we still need 'preferredencoding' for some things,
like make_commit_message(), since that should be in preferredencoding.

And probably if you redirect the output to a file, that should also be
in user_encoding, rather than terminal encoding. But Alexander is our #1
windows user, and for him it was better to still use terminal encoding.

Maybe just because he wanted to be able to do:

bzr diff > file; cat file

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060815/9c6a7348/attachment.pgp