"Watchdog" job not detecting respawns
Dimitri John Ledkov
xnox at ubuntu.com
Wed Jan 27 03:01:02 UTC 2016
Hello,
Thank you for emailing the upstart devel mailing list.
On 27 January 2016 at 02:00, David Ammouial
<David.Ammouial at microsoft.com> wrote:
> Hello,
>
> I'm trying to follow the guide about how to detect a job respawning:
> http://upstart.ubuntu.com/cookbook/#detect-if-a-job-stopped-before-reaching-its-respawn-limit
>
> First step is this draft of a "watchdog" job:
>
> $ cat /etc/init/alert-on-fail.conf
> start on stopped PROCESS!=respawn
> script
> echo "job '$JOB' just failed" >> /tmp/failures.log
> env >> /tmp/failures.log
> end script
>
> Then, I try to kill the target job using its PID, in the two following
> scenarios:
> 1. Target job doesn't contain "respawn"
> 2. Target job does contain "respawn"
>
Instead of using alert-on-fail.conf jobs, I typically use
upstart-monitor to see all the events happening live. It's a stand
alone package / script that can connect to either system or session
upstart and print all the events as they are happening.
I've created foo.conf with just "exec sleep 10000" in it. And e.g.
killed it first with sigkill, and then again with sigfpe (which
triggered apport crash notification).
1 2016-01-27 02:37:39.974577 starting JOB='foo' INSTANCE=''
2 2016-01-27 02:37:39.976374 started JOB='foo' INSTANCE=''
3 2016-01-27 02:37:47.407509 stopping JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='main' EXIT_SIGNAL='KILL'
4 2016-01-27 02:37:47.408379 stopped JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='main' EXIT_SIGNAL='KILL'
5 2016-01-27 02:38:13.786496 starting JOB='foo' INSTANCE=''
6 2016-01-27 02:38:13.792820 started JOB='foo' INSTANCE=''
9 2016-01-27 02:38:49.368112 stopping JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='main' EXIT_SIGNAL='FPE'
10 2016-01-27 02:38:49.368491 stopped JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='main' EXIT_SIGNAL='FPE'
11 2016-01-27 02:38:49.400805 starting JOB='update-notifier-crash'
INSTANCE='/var/crash/_bin_sleep.1000.crash'
12 2016-01-27 02:38:49.406675 file FILE='/var/crash/*.crash'
EVENT='create' MATCH='/var/crash/_bin_sleep.1000.crash'
13 2016-01-27 02:38:49.407015 started JOB='update-notifier-crash'
INSTANCE='/var/crash/_bin_sleep.1000.crash'
14 2016-01-27 02:38:54.092553 stopping JOB='update-notifier-crash'
INSTANCE='/var/crash/_bin_sleep.1000.crash' RESULT='ok'
15 2016-01-27 02:38:54.093028 stopped JOB='update-notifier-crash'
INSTANCE='/var/crash/_bin_sleep.1000.crash' RESULT='ok'
> In scenario 1, I get content in /tmp/failures.log, which is expected:
> job 'foo' just failed
> UPSTART_INSTANCE=
> INSTANCE=
> UPSTART_JOB=alert-on-fail
> TERM=linux
> PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
> UPSTART_EVENTS=stopped
> PWD=/
> RESULT=ok
> JOB=foo
>
> In scenario 2 however, I don't get anything.
>
But what has foo main process do then? and what was expected to
happen? Did it exit with a normal return code?
> I'm noting that the $PROCESS environment variable is not set, and that $RESULT is "ok" although I'm expecting it to be "failed".
>
If a job exits normally, that's normal... E.g.:
$ start foo
foo start/running, process 7912
2 2016-01-27 02:49:24.247490 starting JOB='foo' INSTANCE=''
3 2016-01-27 02:49:24.250735 started JOB='foo' INSTANCE=''
$ restart foo
foo start/running, process 7914
4 2016-01-27 02:49:27.116225 stopping JOB='foo' INSTANCE='' RESULT='ok'
5 2016-01-27 02:49:27.121763 starting JOB='foo' INSTANCE=''
6 2016-01-27 02:49:27.122239 started JOB='foo' INSTANCE=''
$ stop foo
foo stop/waiting
7 2016-01-27 02:49:28.735754 stopping JOB='foo' INSTANCE='' RESULT='ok'
8 2016-01-27 02:49:28.742210 stopped JOB='foo' INSTANCE='' RESULT='ok'
> What am I missing? What is the proper way to detect that a process is being respawned?
>
use upstart-monitor to see _all_ events as they happen. It's a nice debug tool.
PROCESS=respawn is only emitted when upstart decides to, and restarts
a job with respawn stanza and reaches the respawn limits. If a
daemon/job, e.g. catches a signal and re-execs no events will appear
on upstart side as far as I can remember.
In general, to catch every time a job is stopping (dying voluntarily
or otherwise) -> start on stopping; to catch every time a job is
starting -> start on starting.
Here are events for:
respawn
exec false
(deliberately each exec will fail, respawned until giving up)
2 2016-01-27 02:55:19.004438 starting JOB='foo' INSTANCE=''
3 2016-01-27 02:55:19.007424 started JOB='foo' INSTANCE=''
4 2016-01-27 02:55:19.007694 stopping JOB='foo' INSTANCE='' RESULT='ok'
5 2016-01-27 02:55:19.008671 starting JOB='foo' INSTANCE=''
6 2016-01-27 02:55:19.009280 started JOB='foo' INSTANCE=''
7 2016-01-27 02:55:19.009887 stopping JOB='foo' INSTANCE='' RESULT='ok'
8 2016-01-27 02:55:19.010500 starting JOB='foo' INSTANCE=''
9 2016-01-27 02:55:19.011111 started JOB='foo' INSTANCE=''
10 2016-01-27 02:55:19.011328 stopping JOB='foo' INSTANCE='' RESULT='ok'
11 2016-01-27 02:55:19.016796 starting JOB='foo' INSTANCE=''
12 2016-01-27 02:55:19.017176 started JOB='foo' INSTANCE=''
13 2016-01-27 02:55:19.017434 stopping JOB='foo' INSTANCE='' RESULT='ok'
14 2016-01-27 02:55:19.017664 starting JOB='foo' INSTANCE=''
15 2016-01-27 02:55:19.017851 started JOB='foo' INSTANCE=''
16 2016-01-27 02:55:19.018032 stopping JOB='foo' INSTANCE='' RESULT='ok'
17 2016-01-27 02:55:19.018254 starting JOB='foo' INSTANCE=''
18 2016-01-27 02:55:19.018470 started JOB='foo' INSTANCE=''
19 2016-01-27 02:55:19.018643 stopping JOB='foo' INSTANCE='' RESULT='ok'
20 2016-01-27 02:55:19.018834 starting JOB='foo' INSTANCE=''
21 2016-01-27 02:55:19.022553 started JOB='foo' INSTANCE=''
22 2016-01-27 02:55:19.022828 stopping JOB='foo' INSTANCE='' RESULT='ok'
23 2016-01-27 02:55:19.023021 starting JOB='foo' INSTANCE=''
24 2016-01-27 02:55:19.023194 started JOB='foo' INSTANCE=''
25 2016-01-27 02:55:19.024986 stopping JOB='foo' INSTANCE='' RESULT='ok'
26 2016-01-27 02:55:19.025237 starting JOB='foo' INSTANCE=''
27 2016-01-27 02:55:19.025416 started JOB='foo' INSTANCE=''
28 2016-01-27 02:55:19.025601 stopping JOB='foo' INSTANCE='' RESULT='ok'
29 2016-01-27 02:55:19.027167 starting JOB='foo' INSTANCE=''
30 2016-01-27 02:55:19.027403 started JOB='foo' INSTANCE=''
31 2016-01-27 02:55:19.027581 stopping JOB='foo' INSTANCE='' RESULT='ok'
32 2016-01-27 02:55:19.028896 starting JOB='foo' INSTANCE=''
33 2016-01-27 02:55:19.029242 started JOB='foo' INSTANCE=''
34 2016-01-27 02:55:19.039131 stopping JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='respawn'
35 2016-01-27 02:55:19.039371 stopped JOB='foo' INSTANCE=''
RESULT='failed' PROCESS='respawn'
10 attempts, until a hard failure.
Hopefully above upstart-monitor logs are somewhat helpful.
--
Regards,
Dimitri.
-------------- next part --------------
1 2016-01-27 02:37:39.974577 starting JOB='foo' INSTANCE=''
2 2016-01-27 02:37:39.976374 started JOB='foo' INSTANCE=''
3 2016-01-27 02:37:47.407509 stopping JOB='foo' INSTANCE='' RESULT='failed' PROCESS='main' EXIT_SIGNAL='KILL'
4 2016-01-27 02:37:47.408379 stopped JOB='foo' INSTANCE='' RESULT='failed' PROCESS='main' EXIT_SIGNAL='KILL'
5 2016-01-27 02:38:13.786496 starting JOB='foo' INSTANCE=''
6 2016-01-27 02:38:13.792820 started JOB='foo' INSTANCE=''
9 2016-01-27 02:38:49.368112 stopping JOB='foo' INSTANCE='' RESULT='failed' PROCESS='main' EXIT_SIGNAL='FPE'
10 2016-01-27 02:38:49.368491 stopped JOB='foo' INSTANCE='' RESULT='failed' PROCESS='main' EXIT_SIGNAL='FPE'
11 2016-01-27 02:38:49.400805 starting JOB='update-notifier-crash' INSTANCE='/var/crash/_bin_sleep.1000.crash'
12 2016-01-27 02:38:49.406675 file FILE='/var/crash/*.crash' EVENT='create' MATCH='/var/crash/_bin_sleep.1000.crash'
13 2016-01-27 02:38:49.407015 started JOB='update-notifier-crash' INSTANCE='/var/crash/_bin_sleep.1000.crash'
14 2016-01-27 02:38:54.092553 stopping JOB='update-notifier-crash' INSTANCE='/var/crash/_bin_sleep.1000.crash' RESULT='ok'
15 2016-01-27 02:38:54.093028 stopped JOB='update-notifier-crash' INSTANCE='/var/crash/_bin_sleep.1000.crash' RESULT='ok'
More information about the upstart-devel
mailing list