[MERGE] Show the diff in the commit messages

Mon Jul 16 07:03:40 BST 2007

On 7/14/07, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Goffredo Baroncelli wrote:
> >> Causes the diff to be interpreted as utf-8, instead of leaving it as
> >> binary.
> >
> > Yes, because the paths in the diff header are utf8 encoded!
>
> Perhaps we should make the diff header encoding selectable.

This actually came up just recently when Jonathan posted a patch to
fix the display of filenames in the diff.  If I understand correctly,
what we really want is:

 * The headers should be displayed in the user's appropriate locale encoding.

 * The diff should be written as a byte stream with no interpretation,
because we don't know what encoding the stored bytes are in.

This is a bit odd because the output stream may have different
sections in different encodings.  However, the common cases are
probably that either the contents or the filenames or both are in
plain ascii, or the two encodings are the same, and so it's not
actually a problem.

However, when we discussed jml's patch, it looked like it was hard to
get quite the right encoding for the diff headers for implementation
reasons.  (Thinking about it now I'm not quite sure I buy that -
couldn't we just look at the user encoding from within the diff code?)

I would really like to get this feature though.

I know this change is already rejected but just for a future submission:
@@ -204,7 +204,23 @@
     # confirm/write a message.
     from StringIO import StringIO       # must be unicode-safe
     from bzrlib.status import show_tree_status
-    status_tmp = StringIO()
+    class UnicodeStringIO(StringIO):
+        def __init__(self, buf='', decoding='utf8'):
+                StringIO.__init__(self, buf)
+                self._usio_decoding = decoding
+
+        def write(self, s):
+            if not isinstance(s, unicode):
+                s = s.decode(self._usio_decoding, "replace")
+            StringIO.write(self, s)
+    status_tmp = UnicodeStringIO()
     show_tree_status(working_tree, specific_files=specific_files,
                      to_file=status_tmp)
+    if diff:
+        status_tmp.write(u"\n")
+        from bzrlib.diff import show_diff_trees
+        show_diff_trees(working_tree.basis_tree(), working_tree,
+                    status_tmp,
+                    specific_files)
+

Even if we were adding this, I'd like something like UnicodeStringIO
to be declared separately and to have some tests.

Thanks for including a manual update.

I can see two reasonable ways to update this:

1- Redefine the commit message template as a byte string (make a new
method with that meaning), so that we can allow it to include
uninterpreted binary data from the diff.

2- Interpret the diff as being in the user's encoding and read it into
the commit message template with errors=replace.  That has the benefit
that they should actually be able to read all of it without complaints
from their editor, and since the diff is just for display it doesn't
matter so much if some data is lost.  The main problem here is that
people with non-utf8 locales and non-ascii filenames will get them
mangled until we fix diff to use the user's encoding.  But we should
do that anyhow.

So I think I like #2 best.

In fact, just reading it as ascii, errors=replace would be pretty useful.

-- 
Martin