RFC: Best practices for upstart jobs in Ubuntu

Fri Jan 14 17:17:58 UTC 2011

This is a list of upstart best practices that I've put together
recently. I'm looking for some feedback and especially the answer to the
question of "where should this go?". My thinking is it should end up on
a site like upstart.ubuntu.com but maybe it could be on upstart.at? Also
it might fit well with the upstart-intro manpage that James Hunt sent
earlier this week.

The formatting is for the MoinMoin wiki, as another option is
help.ubuntu.com or wiki.ubuntu.com.

Also the 'post 11.04' bits are assuming that my proposed abstract job
'network-services' is accepted into Natty. If you have comments on that,
please see

https://launchpad.net/bugs/701576

== By stanza ==
=== start on ===
==== Traditional Network Services ====
start on (local-filesystems and net-device-up IFACE!=lo)

post 11.04:

start on started network-services

==== Dependent on another service ====
start on started other-service
==== Must preseed another service ====
start on starting other-service

Example: your web app needs memcached to be started before apache

start on starting apache2
stop on stopped apache2
respawn

exec /usr/sbin/memcached

=== stop on ===
==== Traditional Network Services ====

stop on runlevel [016]

post 11.04:

stop on stopping network-services

==== Dependent on another service ====
stop on stopping other-service

Note that this also will stop when other-service is restarted manually,
so this will help as well:

start on starting other-service

==== Enhances another service ====
stop on stopped other-service

=== respawn ===

This one may be a bit confusing for some. Absent respawn or task, a job
will be kept running "no matter what". If it exits with a 0 status, it
will be run again, and there is no limit to the number of times this
will happen, though it may be throttled. Basically, without respawn or
task, the job will work similar to a line in /etc/inittab.

However, with the 'respawn' stanza, it will only be respawned if it
exits "abnormally". This is defined as any exit status not equal to 0.
One can redefine this criteria with the 'normal exit' stanza, which also
lets you define SIGNAL's that can be considered OK ways to kill the
service and not have it respawned.

It is recommended that most server type services use the 'respawn'
stanza. If this is left off, then there is a danger that the service
will be restarted over and over when it should not. 

=== task ===

task is for something that you just want to run when a certain event
happens.

EXAMPLE:

# pre-warm-memcache

start on started memcached

task

exec /path/to/pre-warm-memcached

The key point here is that it is not considered "started" until the exec
portion *exits* successfully.

So you can have another job that starts your background queue worker
once the local memcached is pre-warmed:

# queue-worker

start on started pre-warm-memcache
stop on stopping memcached

respawn

exec /usr/local/bin/queue-worker

=== expect ===

Upstart will keep track of the process ID that it thinks belongs to a
job (or multiple if it has instances)

If you do not give an 'expect' line, then upstart will track the life
cycle of the exact pid that it executes. However, many unix services
will "daemonize", meaning that they will create a new process that lets
go of the terminal and other bits, and exits.

In this case, upstart must have a way to track it, so you can use
'expect fork', or 'expect daemon'. These will expect the process
executed to fork once, or twice, respectively, and then track that pid.

This can be tricky, and its preferrable, if your daemon has a "don't
daemonize" or "don't fork" mode, then its much simpler to use that and
not run with fork following. One issue with that though, is that upstart
will fire the 'started' event as soon as it has executed your daemon,
which will be before a network service is listening for instance.
=== pre-start ===

Use this stanza to prep the environment for the job. Clearing out
cache/tmp dirs is a good idea, but any heavy logic is discouraged, as
upstart job files should read like configuration files, not so much like
complicated software.

As an example:

pre-start script
  [ -d "/var/cache/squid" ] || squid -k
end script

=== post-start ===

Use this stanza when the job executed needs to be waited for before
being considered "started". An example is mysql.. after executing it, it
may need to perform recovery operations before accepting network
traffic. Rather than start dependent services, you can have a post-start
like this:

post-start script
  while ! mysqladmin ping localhost ; do sleep 1 ; done
end script

=== pre-stop ===

Stopping a job will involve sending SIGTERM to it. If there is anything
that needs to be done before SIGTERM, do it here. Arguably, services
should handle SIGTERM very gracefully, so this shouldn't be necessary.
However, if the service takes > kill timeout seconds (default, 5
seconds) then it will be sent SIGKILL, so if there is anything critical,
like a flush to disk, and raising kill timeout is not an option,
pre-stop is not a bad place to do it.

=== post-stop ===

There are times where the cleanup done in pre-start is not enough.
Ultimately, the cleanup should be done both pre-start and post-stop, to
ensure the service starts with a consistent environment, and does not
leave behind any mess.

=== exec / script ===

If it is possible, you'll want to run your daemon with a simple exec
line. Something like this

exec /usr/bin/mysqld

If you need to do some scripting before starting the daemon, script
works fine here. Here is one example of using a script stanza that may
be non-obvious:

# statd - NSM status monitor

description	"NSM status monitor"
author		"Steve Langasek <steve.langasek at canonical.com>"

start on (started portmap or mounting TYPE=nfs)
stop on stopping portmap

expect fork
respawn

env DEFAULTFILE=/etc/default/nfs-common

pre-start script
	if [ -f "$DEFAULTFILE" ]; then
	    . "$DEFAULTFILE"
	fi

	[ "x$NEED_STATD" != xno ] || { stop; exit 0; }

	start portmap || true
	status portmap | grep -q start/running
	exec sm-notify
end script

script
	if [ -f "$DEFAULTFILE" ]; then
	    . "$DEFAULTFILE"
	fi

	if [ "x$NEED_STATD" != xno ]; then
		exec rpc.statd -L $STATDOPTS
	fi
end script

Because this job is marked "respawn", an exit of 0 is "ok" and will not
force a respawn (only exitting with a non-0 exit or being killed by an
unexpected signal causes a respawn), this script stanza is used to start
the optional daemon rpc.statd based on the defaults file. If
NEED_STATD=no is in /etc/default/nfs-common , this job will run this
snippet of script, and then the script will exit with 0 as its return
code. Upstart will not respawn it, but just gracefully see that it has
stopped on its own, and return to 'stopped' status. If, however,
rpc.statd had been run, it would stay in the 'start/running' state and
be tracked normally.

=== instance ===

Sometimes you want to run the same job, but with different arguments.
The variable that defines the unique instance of this job is defined
with 'instance'.

Example:

Lets say that once memcached is up and running, we want to start a queue
worker for each directory in /var/lib/queues:

# queue-workers

start on started memcached

task

script
  for dir in `ls /var/lib/queues` ; do
    start queue-worker QUEUE=$dir
  done
end script

And now

# queue-worker

stop on stopping memcached

respawn

instance $QUEUE

exec /usr/local/bin/queue-worker $QUEUE

In this way, upstart will keep them all running with the specified
arguments, and stop them if memcached is ever stopped.