[MERGE] Show the diff in the commit messages

Mon Jul 16 19:07:51 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 7/14/07, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Goffredo Baroncelli wrote:
>> >> Causes the diff to be interpreted as utf-8, instead of leaving it as
>> >> binary.
>> >
>> > Yes, because the paths in the diff header are utf8 encoded!
>>
>> Perhaps we should make the diff header encoding selectable.
> 
> This actually came up just recently when Jonathan posted a patch to
> fix the display of filenames in the diff.  If I understand correctly,
> what we really want is:
> 
> * The headers should be displayed in the user's appropriate locale
> encoding.
> 
> * The diff should be written as a byte stream with no interpretation,
> because we don't know what encoding the stored bytes are in.
> 
> This is a bit odd because the output stream may have different
> sections in different encodings.  However, the common cases are
> probably that either the contents or the filenames or both are in
> plain ascii, or the two encodings are the same, and so it's not
> actually a problem.

Actually, in standard Windows, they are different encodings. Very
unfortunately, but still true.

I know Alexander ran into this. But I believe the standard disk encoding is
cp1251, but the terminal encoding is cp437. I won't guarantee those values, but
the *important* thing is that the standard terminal encoding is different from
the standard disk encoding.

Originally, we went with utf-8 diff paths because we wanted to have a 'standard
diff' that could be used by patch. Bundles/merge directives mean we don't need
this, so I'm happy to have us switch to the 'to_file.encoding' value.

> 
> However, when we discussed jml's patch, it looked like it was hard to
> get quite the right encoding for the diff headers for implementation
> reasons.  (Thinking about it now I'm not quite sure I buy that -
> couldn't we just look at the user encoding from within the diff code?)
> 

The 'to_file' may not be a terminal. I could be calling it from Olive, and
wanting a UTF-8 encoded result, but I'm on Windows with a user encoding of
cp1251. cStringIO.StringIO() doesn't have an '.encoding' attribute, so I don't
have a way of passing that information in, without a passing a 'path_encoding'
through a whole bunch of layers.

> I would really like to get this feature though.
> 
> I know this change is already rejected but just for a future submission:
> @@ -204,7 +204,23 @@
>     # confirm/write a message.
>     from StringIO import StringIO       # must be unicode-safe
>     from bzrlib.status import show_tree_status
> -    status_tmp = StringIO()
> +    class UnicodeStringIO(StringIO):
> +        def __init__(self, buf='', decoding='utf8'):
> +                StringIO.__init__(self, buf)
> +                self._usio_decoding = decoding
> +
> +        def write(self, s):
> +            if not isinstance(s, unicode):
> +                s = s.decode(self._usio_decoding, "replace")
> +            StringIO.write(self, s)
> +    status_tmp = UnicodeStringIO()
>     show_tree_status(working_tree, specific_files=specific_files,
>                      to_file=status_tmp)
> +    if diff:
> +        status_tmp.write(u"\n")
> +        from bzrlib.diff import show_diff_trees
> +        show_diff_trees(working_tree.basis_tree(), working_tree,
> +                    status_tmp,
> +                    specific_files)
> +
> 
> Even if we were adding this, I'd like something like UnicodeStringIO
> to be declared separately and to have some tests.
> 
> Thanks for including a manual update.
> 
> I can see two reasonable ways to update this:
> 
> 1- Redefine the commit message template as a byte string (make a new
> method with that meaning), so that we can allow it to include
> uninterpreted binary data from the diff.
> 
> 2- Interpret the diff as being in the user's encoding and read it into
> the commit message template with errors=replace.  That has the benefit
> that they should actually be able to read all of it without complaints
> from their editor, and since the diff is just for display it doesn't
> matter so much if some data is lost.  The main problem here is that
> people with non-utf8 locales and non-ascii filenames will get them
> mangled until we fix diff to use the user's encoding.  But we should
> do that anyhow.
> 
> So I think I like #2 best.

See above why #2 doesn't actually work like you want.

> 
> In fact, just reading it as ascii, errors=replace would be pretty useful.
> 

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGm7OiJdeBCYSNAAMRAgmlAKCEOjEPtLOeH7Y7QbTJv5AhhQuKAwCgrkD4
CDaxZ+7lRtnSIPJjEa3jqjU=
=oZCj
-----END PGP SIGNATURE-----