[MERGE] Show the diff in the commit messages
John Arbash Meinel
john at arbash-meinel.com
Mon Jul 16 19:07:51 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martin Pool wrote:
> On 7/14/07, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Goffredo Baroncelli wrote:
>> >> Causes the diff to be interpreted as utf-8, instead of leaving it as
>> >> binary.
>> >
>> > Yes, because the paths in the diff header are utf8 encoded!
>>
>> Perhaps we should make the diff header encoding selectable.
>
> This actually came up just recently when Jonathan posted a patch to
> fix the display of filenames in the diff. If I understand correctly,
> what we really want is:
>
> * The headers should be displayed in the user's appropriate locale
> encoding.
>
> * The diff should be written as a byte stream with no interpretation,
> because we don't know what encoding the stored bytes are in.
>
> This is a bit odd because the output stream may have different
> sections in different encodings. However, the common cases are
> probably that either the contents or the filenames or both are in
> plain ascii, or the two encodings are the same, and so it's not
> actually a problem.
Actually, in standard Windows, they are different encodings. Very
unfortunately, but still true.
I know Alexander ran into this. But I believe the standard disk encoding is
cp1251, but the terminal encoding is cp437. I won't guarantee those values, but
the *important* thing is that the standard terminal encoding is different from
the standard disk encoding.
Originally, we went with utf-8 diff paths because we wanted to have a 'standard
diff' that could be used by patch. Bundles/merge directives mean we don't need
this, so I'm happy to have us switch to the 'to_file.encoding' value.
>
> However, when we discussed jml's patch, it looked like it was hard to
> get quite the right encoding for the diff headers for implementation
> reasons. (Thinking about it now I'm not quite sure I buy that -
> couldn't we just look at the user encoding from within the diff code?)
>
The 'to_file' may not be a terminal. I could be calling it from Olive, and
wanting a UTF-8 encoded result, but I'm on Windows with a user encoding of
cp1251. cStringIO.StringIO() doesn't have an '.encoding' attribute, so I don't
have a way of passing that information in, without a passing a 'path_encoding'
through a whole bunch of layers.
> I would really like to get this feature though.
>
> I know this change is already rejected but just for a future submission:
> @@ -204,7 +204,23 @@
> # confirm/write a message.
> from StringIO import StringIO # must be unicode-safe
> from bzrlib.status import show_tree_status
> - status_tmp = StringIO()
> + class UnicodeStringIO(StringIO):
> + def __init__(self, buf='', decoding='utf8'):
> + StringIO.__init__(self, buf)
> + self._usio_decoding = decoding
> +
> + def write(self, s):
> + if not isinstance(s, unicode):
> + s = s.decode(self._usio_decoding, "replace")
> + StringIO.write(self, s)
> + status_tmp = UnicodeStringIO()
> show_tree_status(working_tree, specific_files=specific_files,
> to_file=status_tmp)
> + if diff:
> + status_tmp.write(u"\n")
> + from bzrlib.diff import show_diff_trees
> + show_diff_trees(working_tree.basis_tree(), working_tree,
> + status_tmp,
> + specific_files)
> +
>
> Even if we were adding this, I'd like something like UnicodeStringIO
> to be declared separately and to have some tests.
>
> Thanks for including a manual update.
>
> I can see two reasonable ways to update this:
>
> 1- Redefine the commit message template as a byte string (make a new
> method with that meaning), so that we can allow it to include
> uninterpreted binary data from the diff.
>
> 2- Interpret the diff as being in the user's encoding and read it into
> the commit message template with errors=replace. That has the benefit
> that they should actually be able to read all of it without complaints
> from their editor, and since the diff is just for display it doesn't
> matter so much if some data is lost. The main problem here is that
> people with non-utf8 locales and non-ascii filenames will get them
> mangled until we fix diff to use the user's encoding. But we should
> do that anyhow.
>
> So I think I like #2 best.
See above why #2 doesn't actually work like you want.
>
> In fact, just reading it as ascii, errors=replace would be pretty useful.
>
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGm7OiJdeBCYSNAAMRAgmlAKCEOjEPtLOeH7Y7QbTJv5AhhQuKAwCgrkD4
CDaxZ+7lRtnSIPJjEa3jqjU=
=oZCj
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list