story of a failure

Mon Apr 29 13:48:43 UTC 2013

On 29 April 2013 11:08, Gavin Panella <gavin.panella at canonical.com> wrote:
> On 29 April 2013 10:57, roger peppe <roger.peppe at canonical.com> wrote:
>> It would be possible to add enough logging statements that
>> we could work out exactly what was going on at the time
>> of a failure, but we would end up producing enormous log files.
>> They are big enough as it is - when Dave was doing some scale
>> testing recently, 8 hours of running juju (and not doing much
>> except starting lots of machines) produced a ~50MB log
>> file. If we produce too much logging data, people are going
>> to turn off logging and then we won't have anything to go on
>> when things do go wrong.
>
> Juju could have a in-memory ring buffer to capture log messages at all
> severities. A snapshot would be taken when there's an error and dumped
> to a file, somewhat like an OOPS report. That way there's context for
> developers, and error messages can remain pithy, more suitable for end
> users.

That's an interesting idea, and one I've given some consideration to.

It does have some down sides though:

- If we save the snapshot to a file, that bypasses our usual
logging mechanisms, so we won't see messages with debug-log,
for example.

- if the app crashes half way through processing a request,
you lose all the ring-buffered messages (unless the ring buffer
is persistent, I suppose, but that seems like potentially considerable
extra complexity in an area where ultra-simplicity really helps)

- you'd probably want to buffer *all* messages like this, otherwise
you get out of order log messages which seems like it would
be a pain. If you buffer all messages, you'd need to decide when to flush
the log, which in a situation with hundreds or thousands of
outstanding requests, is not going to be easy.

- errors might not be that infrequent. If you're not careful, you
may end up saving just as much data as without the ring buffer.

- it seems arbitrary - how many log messages do you save
in the ring buffer? In my experience, some failures can take a
long time to happen - there may be 5 minutes or 30 minutes or
a day between the start of the request and its final result, and
in that time the ring buffer may well have wrapped many times.

As for generating pithy error messages suitable for end users,
I'm not sure that my suggested approach is incompatible with
that. You can truncate my long error message above at any
":" and it will still be useful and make sense. We could do that
automatically (e.g. show a maximum of 3 levels) if we thought
that made sense.

One way of looking at error values in Go is that any function
can return  two kinds of error: those that are documented as being
returned by the function (whether they are particular types or specific
error values) and all others, let's call them "internal errors".

I believe it's wrong to hinge program logic on the latter kind of error,
(that would be relying on fragile and undocumented internal
implementation details) so the best we can do is to produce
a decent error report in such a case - perhaps the right place
to do that is any time we are about to throw away such an error.

In the specific case of the API, we could easily have an error code
InternalErrorCode, which the GUI could see and hide the details
of the rest of the error unless specifically prompted by the user to show them.

  cheers,
    rog.