[RFC] Does ~/.bzr.log have to be exactly utf-8 encoded?

Tue Aug 22 05:52:20 BST 2006

On 21 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> We have a number of bugs that have arisen over time, because we may pass
> unfiltered arguments to mutter(). Because we declared '~/.bzr.log' as
> utf-8 encoded, we use a codecs.open(..., 'utf8') file.
> 
> What this does, however, is require that all strings passed to
> file.write() be valid utf8 strings. Even more importantly, because
> internally the file is actively encoding every string that is passed in,
> if you pass a utf-8 string, it tries to up-cast it to Unicode, so that
> it can down encode it into utf-8.
> 
> By default, most people's default encoding is 'ascii' not 'utf8'. You
> have to manually customize site.py to change this.
> 
> There are a couple of ways to approach the problem:
> 
> 1) Do what I've been doing. Manually upcast to Unicode, and if anything
> fails, use 'repr()' to turn it into an ascii string. This works, but it
> means that when you pass certain strings, they look a whole lot uglier
> than they have to. Especially certain tracebacks, etc, end up showing up
> as a repr() string, rather than something that might look nicer.
> 
> 2) Change 'bzrlib.trace._trace_file' to be a standard file, and manually
> downcast Unicode to utf-8 before writing it out.

So that cast would be in mutter(), or something called from there?
Along the lines of 

  s = fmt % args
  if isinstance(s, unicode):
    s = s.encode('utf-8')
  trace_file.write(s)

> (1) is certainly possible, and I've already done the fix to do it, as
> part of fixing some other bugs.
> 
> (2) has the property that the output file isn't guaranteed to be in
> utf-8. *most* of it will be, but if you did:
> mutter('foo: %s', '\xff\xff\xff\xff')
> 
> Then literal '\xff' characters would be output into the log file.

I think that's OK.

> Anyway, the question basically boils down to: is it better to get more
> repr() strings in ~/.bzr.log to ensure that it is truly utf-8 encoded.
> Or is it better to be as close to utf-8 as we can, but avoid repr()
> strings, since they are harder to understand.

I think having it be not strictly utf-8 would be OK.  It's not read back
in bzr, and is not likely to be.  If people do read it in an editor or
less then those programs will probably tolerate incorrect encoding.

This also relates to the way debug messages are captured during testing.
In that case we do need to be able to process them in Python, to e.g.
display when the test fails.

I'd like to be able to do say 'bzr -D' and get mutter output to stderr -
it's more straightforward for debugging some things.  But I suppose that
can still do an encoding which either passes through byte strings, or
perhaps removes unprintable characters.

My main priority would be that mutter doesn't fail even if people pass
nominally incorrect parameters (non-decodable byte strings etc).

-- 
Martin