Detecting and handling service failures

Fri Aug 24 18:25:58 UTC 2012

On Fri, Aug 24, 2012 at 2:08 PM, David Tong <scarabus at gmail.com> wrote:
> I am familiar with SMF on Solaris. In particular, when a service cannot be
> started by SMF it is marked as being
> in maintenance state. I'm trying to use upstart to detect and report on
> similar conditions.
>
> My understanding of the way that Upstart works is that if a service fails
> then an event is emitted
> indicating the failure and the service is stopped. If you don't catch the
> event then you don't know it's failed.
> If a user queries the status of a service they only see that it is stopped;
> they don't see the reason.
> Am I right in thinking that once a service is stopped the only way to
> determine the cause is to view the system logs?
>
> Now it's easy to configure upstart to run a job when another process fails:
>    start on stopped tongo RESULT=failed
>
> But as far as I can work out you would need to explicitly enumerate all the
> jobs that you wanted to monitor -
> or is there a wildcard option?
>    start on stopped *ANY* RESULT=failed

I believe simply omitting the job name acts as a wildcard? (I've not
tested this, but it ought to work if I understand correctly). So your
stanza would be: start on stopped RESULT=failed

> What about the case where a new service is added? Obviously I also want to
> be notified if that fails.

Would be caught by the previous stanza, assuming it works

> Specific RTFM pointers would be welcomed.

The bible of upstart is: http://upstart.ubuntu.com/cookbook/
I don't think it answers this specific question though.

Cheers,
Evan