Proposed new dependency: github.com/juju/errors (and github.com/juju/errgo)

Fri May 30 17:56:34 UTC 2014

On 30 May 2014 17:24, Jeroen Vermeulen <jeroen.vermeulen at canonical.com> wrote:
> On 2014-05-30 15:08, roger peppe wrote:
>
>>> Two: a caller can deal better with some errors, given more detailed
>>> information.  You can help by attaching more information to the error
>>> (tags,
>>> taxonomy, properties) but only on a best-effort basis.  You accept that
>>> you
>>> don't know exactly which errors can come out of the code further down.
>>
>>
>> This, on the other hand is interesting, and something I have not
>> really considered. Thanks for bringing it up. I haven't written code like
>> this,
>> and I'm not aware of any in the juju-core code base.
>>
>> Have you got any real-world examples you could point to?
>
>
> Usually it's mundane stuff.  I'm sure you're familiar with this and I just
> described in a way that made it harder to recognise!
>
> Think of things like:
>
>  * A broken-connection error can emerge from any of a number of places below
> this function call, but you've got a good way of handling them right after
> you come out of the call.  If you can't recognise the error for whatever
> reason, oh well, the user experience degrades a bit but you handle it as a
> generic program failure.

This example in particular resonates for me, because we do have exactly this
problem in juju. We *could* inspect the error, but there are actually
a reasonably large number of possible errors that could result from
a broken network connection. In our case, we really do need
to know when the connection is broken, so we explicitly test
that, by pinging the other end when we get an error.

The possibility of checking the error certainly occurred to me, but
it seems fragile when a more direct approach exists. For example,
a different kind of error could be returned without reference to the
original (some of the code I linked to earlier does this).
The error might not actually refer to the same connection
that we think it's referring to - we need to assume that the call
we're making is interacting with only one network connection,
and therefore that is the connection but perhaps it's talking to some
other service, and we misdiagnose as a result.

>  * An IMAP client library annotates some kinds of errors with details about
> the command that failed.  It can't decide for you that you'll want those
> details in the error message, but you may still want to log them elsewhere.
> You don't need that on a DNS lookup error, for example, so for that type of
> error you don't much care if you log the details or not.  It won't bother
> you if the library returns a different kind of error than before in some
> situations.
>
>  * In some error situations a failed database transaction can be retried,
> and may well succeed next time.  This can save you a transient failure.  Of
> course it can be difficult to tell whether it's safe to do so.  Some types
> of database errors can say "retrying is safe as far as I'm concerned," but
> you default to failure.

Both of the above seem like the code would be much more understandable
and maintainable if the error kinds were documented and explicitly
maintained by the libraries in question. Otherwise you end up with
a big risk that you've got all this crufty code written which never
gets triggered
but isn't obvious that it's cruft.

>  * You've been writing a file that could get quite large.  If you fail
> before you're done, you want to keep the data you have.  But if the error
> was "disk full," you prefer to clean up the output as a kindness to the
> user.

This, on the other hand sounds like you're in close contact with the
actual file writing code, so it should be relatively easy to check for
syscall.ENOSPC
(now *that*'s code that's hard to test decently :-])

>  * My systems management library calls a library called CoolApt to run the
> equivalent of apt-get update.  Your application uses my library.  It knows
> about the CoolApt "can't acquire packages lock" error and does a bit of
> investigating to provide a better explanation to the user.  When version 2.0
> of my library replaces CoolApt with HotApt, you obviously want to upgrade.
> The error is different, so this feature is temporarily broken — but you may
> find that acceptable.

This is another case where we really want the error types documented and
maintained properly, otherwise this feature is broken and
it's not obvious that it is until someone stumbles across the issue. In this
particular case, the degradation is graceful, but in many similar
situations, it will not be. I'd hope to see a test for this feature
that could be disabled temporarily after we start using the new package.

> These are all "quality of service" distinctions.  The effort of a more
> specific error is appreciated, but you can accept a certain amount of
> change.

In all of these I would prefer to see the error types explicitly mentioned
in the code nonetheless. The first example is more arguable (and
there's luckily a workaround for it), but lacking the workaround, I think there
are definite grounds for defining a ConnectionFatalError, and documenting
and notating the places that can return it, rather than just looking
at the returned error and deducing cause from effect without good reason.

  cheers,
    rog.