story of a failure

Mon Apr 29 09:57:25 UTC 2013

On 27 April 2013 17:09, Gustavo Niemeyer <gustavo at niemeyer.net> wrote:
> On Sat, Apr 27, 2013 at 3:21 AM, David Cheney
> <david.cheney at canonical.com> wrote:
>> The error messages are for us, the developers, to diagnose our customers
>> problems. If the customers find the error messages tiresome, it is our
>> job to fix the errors, not make them more aesthetically pleasing.
>
> It's a noble goal, but it's still missing the real point. Your log
> messages will never get closer to a good log, no matter how much you
> love your error tailoring craft.
>
> What was being done before the failure? What were the actual arguments
> for the query performed? What was the timing between the events?

I think you're right about log messages. We need to give them
some love, and think hard about what we are logging and
how we can trace requests through the system. But I think we should
*also* improve our error messages.

I think that log messages and error messages are somewhat
orthogonal. Log messages can give a good idea of the *flow*
of requests through the system; error messages can give a
single spike probe into the details of what we were trying to
do at the moment of the error.

It would be possible to add enough logging statements that
we could work out exactly what was going on at the time
of a failure, but we would end up producing enormous log files.
They are big enough as it is - when Dave was doing some scale
testing recently, 8 hours of running juju (and not doing much
except starting lots of machines) produced a ~50MB log
file. If we produce too much logging data, people are going
to turn off logging and then we won't have anything to go on
when things do go wrong.

Adding context to errors seems like a Good Thing to me - we
get to know a snapshot of what was going on at the time,
and we have an opportunity at a higher level to associate
that with other stuff in the log.

> An EOF from the database is just that, though. Your request failed
> because the connection was shut on the driver. Saying
>
>     "mongo query failed: cannot acquire socket: cannot log in: error
>     authorizing user "admin": request failed: cannot read reply: EOF"
>
> Is just as good as saying:
>
>     "mongo query failed: EOF"
>
> And I happen to have a good background on debugging database driver problems.

What if the error had been some corrupt data and instead of
a dropped connection, we had some bson error indicating that?
It might very well be relevant then whether the error was
in response to an initial login or a long-running connection.

We're not just talking about database driver problems here - that
just happens to have made a nicely illustrative example.