[RFC] Disabling jobs in Upstart

Mon Jun 20 03:54:46 UTC 2011

James, great write up. The automated configuration management systems
that are available to day are quite incapable of handling upstart,
so I think there's a huge need for some simpler automation.

My reply is inline, and, I'm afraid, a bit rambling as well, as it was
written over a couple of days...

Excerpts from James Hunt's message of Fri Jun 17 12:42:17 -0700 2011:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi All,
> 
> = Caveat =
> 
> This is very much a brain dump and doesn't have all the answers - please
> comment and fill in the blanks when you spot them! :-)
> 
> = Introduction =
> 
> We are looking to provide the ability to fully disable a job.
> 
> = Rationale =
> 
> Lots of users are familiar with the old SysV way of handling jobs and
> are looking for a chkconfig-like tool to ease the transition to Upstart.
> 
> The "manual" stanza coupled with the Override facility does already
> provide this facility, but have the following shortcomings.
> 
> == Shortcomings of Override Files ==
> 
>  * There is no programmatic Upstart interface: it requires a tool/user to
>    manually create a ".override" file contaning the "manual" stanza (or
>    simply appending "manual" to the ".conf" file).
> 
>  * It is too generic a facility / not "fail-safe"
> 
>    Any Admin/tool/pkg can manipulate ".override" files. If an Admin
>    disables a job using a ".override" file, they might find that it has
>    later been changed by another tool that rewrote the override. This is
>    undesirable since the job may no longer be disabled.

Conflicting changes to a single configuration, whether in one file or a
group of files, will always be a problem. In SysV, you can have one tool
that disables a service, and another that just moves it from starting at
position 20 to position 60. Then insserv comes along and reorders it all
for dependencies that the tool didn't account for.

Whats important is that the system provides a gracefully degrading
mechanism that abstracts the disabling/restricting/limiting behavior at
the points that make sense. With boot order, there's really only one
set of concerns to address.. the root user and how they want the system
to boot. So I'm not sure where the override file fails to support this.

The tools that modify the override file *must* be able to introspect
the situation, or be reasonably sure that they can take blind action
that will succeed. Right now, echo manual >> /etc/init/job.override
gives one the blind action assurance. I think we identified the need
for initctl to be able to tell us whether or not a job is manual or
not, which would give a standard way to achieve introspection.

> 
>  * Not obvious how to determine if a job *is* enabled or disabled.
> 
>    It is possible though. See:
> 
>    http://upstart.ubuntu.com/cookbook/#determine-if-a-job-is-disabled

I think there's a need for a stronger keyword, 'disabled', which totally
disables the job even if manually started with 'start foo'. Given that,
the status tool could show if a job is manual and/or disabled. If the
desire is to disable only some of its start/stop conditions, well that
can also be done by adding a new start on to the end of the override file.

Basically there is a difference between wanting something to not run ever,
and wanting to change the default way something runs without modifying
the job file.

> 
> = Requirement =
> 
> A "chkconfig"-like tool [1] to allow:
> 
>  * Jobs to be disabled in particular runlevels.

I think this is far less useful on Debian/Ubuntu. runlevels just aren't
as important as they are in RH where they mean a lot more.

I do see some times where people want to be able to affect one start
condition, but not the other. The precision of the upstart start on
/ stop on conditions make it hard to translate the imprecise (but
simple) methods possible w/ runlevels.

> 
>  * The ability to determine if a job is disabled for a particular
>   runlevel.
> 

I think the visualization tool can be used to tell if a job might be
started or stopped given a particular event. Am I over-stating its
capabilities?

>  * The ability to determine if a job *will* run for a particular
>    runlevel (note: this is *NOT* the same as the bullet above!
>    See below...)
> 
> = Ideal =
> 
> The ideal tool would provide the following details:
> 
>  * Job name.
>  * Instance name.
>  * Which runlevels a job is enabled and disabled in.
>    This breaks down into:
>    * Job is enabled for specified runlevel.
>    * Job is explicitly disabled for specified runlevel.
>    * Job is *implicitly* disabled for specified runlevel.
>  * Whether the job ran last time?
>    (would require an event+job log. Can never be 100% reliable of course
>    since config may have changed between boots.)

If I replace 'runlevel' with 'boot phase', and think of it more like
what Scott said ChromeOS does, this makes more sense. I think most
of the time an admin who wants to change the 'runlevel' something is
enabled or disabled in really just wants to move it from starting before
everything else, or after everything else. Otherwise they want to express
a non-obvious non-generic dependency. Either way, they're both handled
by following jobs, whether its a boot phase job or an explicit job. Then
the tool just needs to be good at manipulating start/stop on conditions.

> 
> = Preliminaries =
> 
> == Thoughts and Observations ==
> 
>  * It is actually rather difficult to map the Upstart event model onto such
>    a tool since SysV init doesn't behave like Upstart (further details below).
> 
>  * If a job is explicitly disabled completely, jobs which start on that
>    job will be implicitly disabled. This information needs to be
>    conveyed somehow.
> 

Again, the manual and (potentially if its agreed up on and comes into
existence) the disabled keywords need to be carried into the visualization
tool. Then whenever you change one of these things, you can very easily
print the diff between the two and ask the user if they're ok with that.

Something like

startup
 -starting mountall
  -starting lxc-mountall
  -started lxc-mountall
  -> local-filesystems
   + started dbus
     -starting network-manager
 -starting dbus
  -started dbus
   +local-filesystems
     -starting network-manager

Where a user or program can detect the diff of jobs that show up on the
criteria and report it. This would be useful in automated integration
testing for Ubuntu as well, since we could very easily install all
co-installable packages and run this, and then raise warnings when a
job wasn't going to start.

>  * If a job has a start on condition as below, what action should we
>    take if the user requests the job be disabled in runlevel 2?:
> 
>      start on foo or runlevel 2
> 
>    Since it is (currently) not possible to know upfront whether "foo" or
>    "runlevel 2" will be satisfied at boot time, it may be reasonable to
>    (by default) disable such a job in runlevel 2 since the "start on"
>    has specified it *might* "start on runlevel 2". We could provide an
>    option to control this subtle behaviour.

We'd have our tool mask out runlevel 2 and do the equivilent of

sudo sh -c 'echo -e "# added by our tool `date`\nstart on foo\n" >> /etc/init/foo.override'

This would disable the start on condition. Asking upstart what its effective
start on condition is would present us with 'start on foo'.

> 
>  * If we provide the ability to disable any job, the system could become
>    unbootable very quickly.
> 
> 
> == Constraints ==
> 
>  * Upstart currently has no knowledge of SystemV runlevels: they are
>    supported through events and external applications such as telinit.
> 
>    This premise should not need to be contravened - the internals of
>    Upstart should not need to be imbued with runlevel knowledge. This
>    implies that:
> 
>    1. The facility should work for *any* event (not just runlevels).
> 
>    2. The facility should be driven by an external tool of some kind (in
>       other words either a program or script which calls initctl as
>       appropriate).
> 
>  * Runlevels are implemented with the "runlevel" event which has a
>    primary environment variable "RUNLEVEL" taking a value from 0 to 6.
>    It needs to be possible to disable a job:
> 
>     * entirely (where it has any "start on" condition).
> 
>     * in all runlevels ("[0123456]").
> 
>     * in some runlevels (for example "[345]").
> 
>  * Upstart allows jobs to be started based on arbitrarily complex
>    conditions. Any facility to disable a job should consider these
>    conditions.
> 

+1 for a tool that helps admins use upstarts event based model rather
than hiding it from them.

> 
> == Categories of Jobs ==
> 
> There are a number of job categories that we need to consider:
> 
>  1. Jobs that specify a start on which does *NOT* include
>     runlevel.
> 
>     They may start before or after the runlevel event is emitted.
> 
>  2. Jobs that start on the initial event.
> 
>     A small handful of jobs "start on startup". This is a specialisation
>     of (1).
> 
>  3. Jobs that "start on runlevel" (a single event).
> 
>     Such jobs may restrict the start on further by specifying
>     environment variables (RUNLEVEL and PREVLEVEL).
> 
>  4. Jobs that specify a "complex" start on (one using "and" / "or")
>     which includes "runlevel".
> 
> = Terminology =
> 
> * "limit"
> 
>   Since we want to be able to disable Upstart jobs based on some
>   condition, "disable" is rather a crude term. The word "limit" is
>   better since it connotes the more fine-grained approach being
>   proposed. Its antonym being "delimit" (I'd initially thought of
>   "restrict" and "derestrict" but (,de)limit is shorter :-)
> 

I like this term, and I like idea of it being able to take an optional
set of start on / stop on keywords that simply mask the given criteria
out of the start on or stop on.  Given the above example:

limit start on runlevel 2

Achieves the desired effect.

> 
> = Scope =
> 
> Ideally, it would be possible to disable a job *instance*. But that is
> probably going to be an "iteration 2" feature.
> 
> Of the four categories of Jobs outlined above, only category (3) and (4)
> can reasonably be dealt with by this design. Category (1) breaks down
> into jobs that run before the runlevel event is emitted (about 20 on an
> Ubuntu oneiric system currently) and jobs that run after. The former
> have to be excluded but the latter may be able to be considered. It is
> possible that many of those would end up being implicitly disabled if a
> job in category (3) or (4) were disabled anyway [2].
> 
> It isn't reasonable to stop category (2) jobs from running since that
> will almost certainly break your system anyway: mountall won't run for
> starters!
> 
> 
> = High-Level Plan =
> 
> My thoughts at this stage are that we provide 3 new commands (note these
> are not *necessarily* initctl commands):
> 
>   * limit <job> [<expr>]
> 
>     Restrict conditions on which job <job> is started. <expr> is assumed
>     to be a subset of the "start on" condition of <job>, however if it
>     is not, this is not an error (but a warning should probably be
>     issued since the command would have no effect at that point in time.
> 
>     QUESTION: If job <job> has already been limited, what do we do:
> 
>     1. Throw an error.
>     2. Replace the existing limit with the new one.

If it would be a noop, do nothing, exit 0. If it would change the
start/stop criteria, just do it. This way it only happens once. --verbose
shows the "not doing anything" or "setting start/stop on to xxx" message
for those who are confused why it did nothing.

> 
>     QUESTION: How would we handle this scenario?:
> 
>     $ restrict cron runlevel [35]
>     $ restrict cron runlevel RUNLEVEL=4

Assuming cron's original start on was

start on runlevel [2345]

And assuming it knows that 'runlevel RUNLEVEL=4' is equivilent to 'runlevel [4]'..

The first call would mask 35 from any current arguments to runlevel and add

# Added by 'limit cron runlevel [35]' Sun Jun 19 19:15:03 -0700
start on runlevel [24]

To the override file. The second one would calculate that 4 must be
removed and add

# Added by 'limit cron runlevel [35]' Sun Jun 19 19:17:08 -0700
start on runlevel [2]

The key is being able to ask upstart what the current effective start
on is now so it can only act on that.

> 
>     Possible outcomes:
> 
>     1. Cron is restricted in runlevels 3+5.
>     1. Cron is restricted in runlevel 4.
>     1. Cron is restricted in runlevel 3, 4 and 5.
> 
>   * delimit <job>
> 
>     Returns any current limit expression and undoes the effect of
>     "limit".

Simplest implementation has this adding

# Added by 'delimit cron' Sun Jun 19 19:18:23 -0700
start on runlevel [2345]

to the override file. A more complicated one might remove all limits
from the override file, but I think the former is more elegant, and the
parsing of 3 or 4 start on's is pretty close to computationally free so
I don't see any downsides to keeping it simple.

If you want to just remove one limit, forget about it applying only to
the limits you've explicitly added. It can delimit *anything*. So

delimit cron start on started mysql

Just adds started mysql as an "OR" condition for cron. This becomes an
elegant and simple to understand tool for any automation system to add
safe event conditions. Since AND requires thought before doing, lets
leave that to manual intervention.

> 
>   * show-limit [<job> [<expr>]]
> 
>     Show limits for all jobs or specified job.
> 
>     Command should emit a warning if any limit is found that is not a
>     subset of the "start on" for the job in question (since the limit
>     will have no effect).
> 
>     If no expression is supplied, show "raw" limit. If an expression
>     *is* specified, determine if job would run given that expression.
> 
>     Example: Assume a job specifies "start on runlevel [345]". If a
>     limit of "runlevel RUNLEVEL=4" has been set, we want a higher-level
>     tool to be able to query directly if the job would run in runlevel 4
>     so returning "runlevel [345]" isn't that helpful. What we really
>     want to say is:
> 
>       $ show-limit foo runlevel 4

This falls back on upstart/initctl I think. A command that says "show
me the possible event chain(s) that leads to job X starting" would be
highly useful even without limits. If it can put a * next to every
condition that is overriden, that might be helpful.

> 
>     And have the tool display whether for "runlevel 4" job foo would run
>     based on the limit of "runlevel [345]". This could be displayed in
>     parseable format and also maybe returned via the return code.
> 
>     Thought: maybe we could add a "query-limit" command specifically for
>     this and have "show-limit" just return the "raw" limit details?
> 
> 
> = Implementation Details =
> 
> == Limit Condition ==
> 
> To satisfy the chkconfig requirement, we could just allow a single event
> and optional environment to be specified. However, the better solution
> is to allow an arbitrary condition (like "start on" and "stop on"). The
> condition could almost be viewed as a "restrict on" stanza. Only one
> such limit condition may be specified.
> 
> XXX: Note that the condition itself -- for the example of runlevels --
> cover all the runlevels where that job must not run. This is an
> important point: the condition only specifies a single runlevel if that
> job should only be disabled in a single runlevel. The "norm" is
> probablly more likely to be where the condition covers *more than one*
> runlevel. This is perfectly acceptable since "show-limit" allows an
> *actual* runlevel to be specified so a higher-level tool can establish
> if a job would be disabled for a particular runlevel.
> 
> == Matching Limits to Events ==
> 
> If a job condition becomes "true" such that Upstart would normally
> attempt to start the job and if that job has a limit condition which
> "matches" part of the EventOperator tree, Upstart will not run the job.

If we just have limit as a tool that manipulates the override file, this
is no longer part of the implementation is it?

> 
> === Examples ===
> 
> start on : runlevel [2345]
> runlevel : 2
> limit    : runlevel 2
> outcome  : match - job will be disabled in runlevel 2.
> 
> 
> start on : runlevel [2345]
> runlevel : 2
> limit    : runlevel
> outcome  : match - job will be disabled in runlevel 2.
> 
> 
> start on : runlevel
> runlevel : 2
> limit    : runlevel [2345]
> outcome  : match - job will be disabled in runlevel 2.
> 
> 
> start on : runlevel 2
> runlevel : 2
> limit    : runlevel [2345]
> outcome  : match - job will be disabled.
> 
> 
> start on : runlevel RUNLEVEL=2
> runlevel : 2
> limit    : runlevel [2345]
> outcome  : match - job will be disabled.
> 
> start on : runlevel [2345]
> runlevel : 2
> limit    : runlevel RUNLEVEL=2
> outcome  : match - job will be disabled.
> 
> 
> start on : runlevel RUNLEVEL=2
> runlevel : 2
> limit    : runlevel [2345]
> outcome  : match - job will be disabled.
> 
> 
> start on : runlevel RUNLEVEL=2 PREVLEVEL=S
> runlevel : 2
> limit    : runlevel [2345]
> outcome  : match - job will be disabled.
> 
> 
> start on : runlevel RUNLEVEL=2
> runlevel : 2
> limit    : runlevel [2345] S
> outcome  : no match - job will run.
> 
> start on : runlevel 2
> runlevel : 2
> limit    : runlevel [345]
> outcome  : no match - job will run. warning will be generated since
>            limit cannot match the start on condition.
> 
> 
> start on : foo or runlevel 2
> runlevel : 2 (foo has not been emitted).
> limit    : runlevel [2345]
> outcome  : match? I think yes.
> 
> 
> start on : foo and runlevel 2
> runlevel : 2 (and foo has been emitted).
> limit    : runlevel [2345]
> outcome  : match - job will not run.
> 

Right all of these are handled gracefully if limit just masks out the
conditions passed to it.

> 
> == Storage of Limit Conditions ==
> 
> The two main ideas here are:
> 
>  * Create a single file to store all limit information.
> 
>    A good location might be "/etc/init.limit". This file would store
>    job restriction details in a simple format such as:
> 
>      <job> [<condition>]
> 
>    So, if job "cron" was disabled entirely, it would contain:
> 
>      cron
> 
>    Whereas if the job was disabled in runlevels 3-5 it would contain:
> 
>      cron runlevel [345]
> 
>    If the file exists on startup, Upstart would read the job
>    limit details.
> 
>    Pros:
> 
>    * Single file outside of /etc/init/ so might be "safer" in the case
>      where an admin ran "cd /etc/init; rm * .override" say by mistake.
> 
>    * It would be a "single point of definition" and thus easier to
>      backup and apply to other systems maybe?
> 
>    Cons:
> 
>    * File would nominally need to be rewritten each time a change was
>      made. Might not be too bad since changing limits is perceived as
>      being an irregular activity (but tell me if you have other views on
>      this! :)
> 
>    * Possible locking issues if multiple requests came in to change a
>      limit at the same time.
> 
>  * Create per job files
> 
>    In a similar fashion to the existing ".conf" and ".override" files,
>    we could introduce "/etc/init/<job>.limit". If this file existed
>    and was empty, the job would be fully disabled (never automatically
>    started). However, if it contains "<condition>", that would be applied.
> 
>    Pros:
>     * Analog to ".conf" and ".override" so familiar to users.
> 
>    Cons:
> 
>     * Easy to inadvertently delete a ".limit" file maybe?
> 
>     * We're starting to create a lot of files now. Theoretically there
>       could now be 3 files / job (".conf", ".override" and ".limit").
>       We're not likely to reach the inotify limit (4096 watches?) yet,
>       but it is something to be aware of, moreso in the server or maybe
>       development server environment.
> 
>   However the Limit Condition file(s) is/are created, care needs to be
>   taken to ensure that it is not possible to lose data should
>   the system fail / be rebooted in mid-write.

Option 3, just store them as overrides.

Pros:
 * Singe point for admins to go to look for overriden settings for a job.
 * Implementation would simply be a script that is able to parse and understand
   upstart's even conditions.
 * No features necessary to add upstart itself. May be useful to expose the job
   parsing as a library but not *essential*.

Cons:
 * May conflict with other tools that manipulate override.
 * May confuse admins who are using override without expecting a system level tool
   to override their .. overrides.