State of Upstart

Wed Sep 12 02:52:57 BST 2007

A couple of months ago, at GUADEC in Birmingham, I took part in a BOF
along with members of Fedora/RedHat and SuSE to discuss Upstart and how
it fits in to the "big picture" alongside udev, HAL, D-BUS, etc.

An obvious focus of this was discussing whether those distributions
would choose to ship Upstart in place of sysvinit, and out of this came
some "strong suggestions" as to changes to be made to Upstart's design.

Rather than leap directly into argument, debate or implementation, I
felt that it was worth "standing back" for a few weeks to let the
proposals settle and evaluate them in a somewhat more rational light.

This also fit well in that I had recently done a fairly large coding
marathon and hit a wall, where the features I had been attempting to
implement turned out to be unworkable with the current design.

So now, in the cold light of a rest, it's useful to go over Upstart's
design and the proposed changes, and debate them rationally.  I welcome
any and all comments and suggestions to this, I strongly suggest anyone
reading this on the mailing list to reply with their thoughts.
Hopefully we can reach a consensus about the way forward.

Service Management
==================

Management of services is at the very core of what Upstart is designed
to do, and is therefore the most important piece to get right.  Each job
that Upstart manages passes through the same state machine:

    http://www.netsplit.com/2007/states.png

and have the following properties:

  * A goal of "stop" or "start".  This is the only externally modifiable
    part of a job.

  * A state, which is one of those illustrated in the above diagram.  A
    job in exactly one state at a time.

  * Most states are transient and have a condition that will inherently
    cause the job to move to the next state; either a process is running
    that is always expected to terminate, or an event has been emitted
    that is always expected to finish.

  * Those states that are not transient ("waiting" and "running")
    require intervention to move them to the next state.  They are
    termed "rest states".  This intervention occurs when the goal is
    changed while in this state.

  * The next state depends on the goal, follow the green line when
    the goal is "start" and the red line when the goal is "stop".

  * The exception is "running" which has three next states; if the
    goal is "start" then the job is being respawned, otherwise the
    next state depends on whether there is a main process running or
    not (manual stop vs. process death).

This state machine has gone unchanged for sufficiently long, and more
than proven its correctness.  Its implementation is simple and elegant,
which is a good sign.

Because the only externally modifiable part is the goal, the behaviour
of "start" and "stop" are very well defined.  The first will change the
goal, and if the job is in a rest state, the state.  Subsequent issues
of the same command will not affect any change, until the opposite
command is given.  A "start" followed by a "stop" results in a
controlled clean-up of the service (whether running or not), a "stop"
followed by a "start" results in a clean restart (or start if not
already running).

  Note: issuing two commands in a row is not truly atomic, since
  commands can be received from other processes in the meantime.  There
  should be IPC commands to "clean-up" and "restart" the service,
  issuing the pair of commands in an atomic fashion, but still
  implemented using the "start" and "stop" primitives.

The behaviour of attached processes is also starting to settle, and is
sufficiently flexible that state-less actions such as "reload" are
possible in future; albeit requiring the definition of some policy
concerning them.

One notable missing piece is the ability to supervise processes that
fork(), detaching from the process that Upstart is supervising.  The
state machine already treats "spawned" and "running" as independent
state, and the intended design is that such a process would not leave
the "spawned" state until it has forked.  The exact mechanism for
detecting this (pid file creation, /proc watching, SIGCHLD, etc.) is up
for debate, but implementable within the current design.

So now on to the flaws with the service management core, which are at
some level all related.

Atomicity, Instantiation and Identification
-------------------------------------------

The original design of Upstart expected that each service be unique,
carrying an identifying name.  Only one copy of any service could be
running at one time, and that multiple attempts to "start" that service
would result in the subsequent attempts being ignored unless preceded by
a "stop" attempt.

This is fine for simple services such as udev or HAL, and implements a
long-needed service management job.  HAL would no longer need to worry
about grabbing a D-BUS name to protect its atomicity, it can rely on
Upstart performing that job and instead grab the name when it needs it
(and treat a failure to do so as a fatal error of some kind).

But it isn't fine for tasks, which require a copy to be started each
time the "start" command is run.

To accommodate those, we added the notion of "instance" jobs.  These are
handled somewhat strangely.  A copy of the job definition exists which
is always stopped; to change the goal to "start", you copy that job
definition in memory and then start that.  Pointers maintain the
relationship between the instances and their original job.

This works for simple instances, but doesn't work if you need to pass
some kind of information to that instance, or the instantiation needs to
be atomic at some other level.

Take the "getty" service for instance (heh).  The most elegant way to
implement this is as a single service that is instantiated for each TTY
that a getty should appear on.  This means that the job needs to have
some information about the intended TTY, passed when it is started; and
Upstart needs to check whether an instance with that TTY already exists
before allowing it to be started.

In order to be able to do this, we're going to need to identify each
instance, and define schemes of atomicity for instances of the job
(assumedly the default would be one instance at a time).

Configuration
-------------

The other flaw in the "unique service" design is when the definition of
a service is changed on disk.  Upstart needs to reload the definition,
but also retain the existing definition in memory while jobs are running
with it.

This is made even more complex by instances, where you have multiple
running jobs sharing a service definition for which there is a different
one on disk.

At what level do you use the new definition?

We opted for "never, as long as a job is running with the old one";
since the definition may include the parameters of atomicity, or
locking, and two versions may not be compatible.  We certainly didn't
want the "post-stop" of the new definition used for a job that ran the
old "pre-start", because it might not clean up everything.

Mechanism for handling this exists through the "deleted" meta-state
(exists simply so that you can still use a Job pointer after calling
job_change_state) and the job_should_replace() function.  But this is
inelegant, and therefore probably wrong.

An alternate solution is possible, whereby the relationship between job
name and current definition is kept separate from the Job structure
itself.

This becomes more even desirable once we support sourcing job
definitions from multiple places, where Upstart may have to resolve
having two or more definitions for a particular job name and select an
order of precedence to handle it.  It cannot discard the lesser
precedent job either, since that becomes the active job definition when
the other is lost.

Solution?
---------

Both of these problems push us towards a design where the definition of
a job and its current state are kept entirely separate.

Firstly we would continue the current configuration implementation where
Upstart knows about sources, files and items; with the items being the
job definitions themselves.

Job definitions are then also registered against their name in a hash
table, which maintains a list of all available job definitions with that
name and a pointer to the current one.

Changing a configuration file would cause any definitions to be removed
from the available list, and removed from the current list only if the
job is replaceable (ie. all instances are stopped).

State of jobs would be kept in a new Instance structure, which are
linked from the definition.  The definition would define the level of
atomicity, and many Jobs would only carry a single Instance.

An important change here is that there would no longer be a "waiting"
state, since an instance attempting to reach that would simply be
discarded.

Service Activation
==================

One of the last things I was working on with Upstart was the activation
of services; before we discuss the changes, let's first go over the
design that exists in the currently released version of Upstart.

Services may specify a list of events that, if any occur, cause their
goal to be set to "start"; and a second list of events that, if any
occur, cause their goal to be set to "stop".

The obvious problem with this is that there is no way to specify that
two events must occur before it can be started.

On trunk, the implemented solution to that was to allow events to be
combined with "and"/"or" operators.  While this might seem to be the
solution, it turns out that this has a particular problem:

Take the simple example of a service that depends on Apache and MySQL
running, it might use something like this:

  start on started apache and started mysql
  stop on stopping apache or stopping mysql

This appears to initially work fine, when Apache is started the Job
won't start yet, but when MySQL is started the Job starts.  Then should
either Apache or MySQL stop, the job stops again.

The problem comes with what happens if either are *restarted* instead;
if it happens quickly enough, it'll appear to work because the event
will just re-affirm the start condition and the Job will start again.

If it takes longer than the Job takes to stop (and therefore be cleaned
up), the knowledge the Apache was still running is lost; and now the Job
will not be started because it's waiting for Apache to be started again.

Two problems are highlighted here, and while either solution appears to
fix it, it turns out that we need both solutions to cover both sets of
use-cases.

The first is that the combination of operators needs to be atomic
itself; once evaluated it should never be evaluated again, except from
scratch.  This solves the strange racey behaviour above.

But we've solved it the wrong way for this example, now you need to
always restart both services to ensure this Job is restarted.  This
leads us onto the other solution.

Sets of related atomic start/stop event groups need to be able to be
tied together, we've called these States.  The definition of a job can
also include groupings of States that must be true; and likewise the
definition of a State can do the same.

In effect, you would define a job or state as:

  from EVENT1 and EVENT2
  until UNEVENT1 or UNEVENT2
  while STATE1 and STATE2

(Nothing that if either UNEVENT1 or UNEVENT2 occur, both EVENT1 and
EVENT2 must occur again)

And now we get into the famous "is a state a job without an attached
process, or is a job a state with an attached process" argument.  My
opinion on this changes as often as my socks.

The argument for states just being process-less-jobs is that they share
the same naming conventions, problems of definition, instantiation,
activation requirements, etc.

The argument for jobs being states-with-processes is that states don't
need the state machine and attached process handling.

I mentioned instantiation, because it turns out that you'd need states
to also have different levels of atomicity.  The "tty exists" state
needs to be instantiated for each unique TTY.  It turns out then that
you need other special state operators:

  * is STATE true for any value?
  * is STATE true for this value?
  * is STATE true for a subset of the following values?
  * is STATE true for all of the following values?
  * is STATE true for any of the following values?

At this point, we have quite a complex definition of when a job should
actually be running; and this gets more fun when you try and decide
which bit is owned by the instance, and which bit is owned by the
definition.

Events
======

Events are, right now, deliberately simple.  They have a name, some
arguments and some environment variables.  The environment variables are
passed through to the job, since they might contain useful information.

The arguments were previously passed through as well, however with
combinations of events, the argument ordering was unclear.  Continuing
along this simple road, we're likely to drop arguments to events;
although then we make the syntax much harder unless we introduce
ordering to the environment variables and make their name optional in
that order.

Assuming that environment variables form the basis of the job
environment, and are used to define a job's atomicity, an event becomes
simply a name attached to a collection of them.  You also end up needing
to specify the same environment variables for the "start" command as
well.

Now onto the interesting part of this; integration with things like
udev, HAL and D-BUS.

It was our original thought that the "simple" event model would
eventually be replaced, yet somewhere along the line we decided that
instead the simple model meant that all other forms of message could be
converted down to it.

We'd use some common naming scheme for the events, and use environment
variables to convey their additional information.  The advantage to this
obviously that Upstart can pretty much present them verbatim to its
services, the disadvantage is that the emitting application needs to
build an event from its information.

Upstart gets a bit of a headache from this, since we suddenly need to
reconstitute states from events emitted from applications.  Take HAL for
instance, it knows that there's a TTY because it has a TTY object.  But
it would communicate this to Upstart by emitting two events, and Upstart
would have to reconstitute those to make a state again.

It's a simple fact that the current trend (at least in the GNOME/fd.o)
world is very much pointing due D-BUS and HAL.  It's also a fact that
they are rather longer established than Upstart is.

Right now, for DBUS messages and HAL objects to be used as Upstart
events, those applications (or some intermediary proxy) need to convert
the message or object events to Upstart events; fitting in with
Upstart's own style.

So we have to ship a file that converts a signal from
gnome-power-manager to an Upstart event saying "battery-critical", etc.
And likewise for every HAL object event, or those from udev, etc.

This gets annoying; now we end up with udev rules, HAL FDI files and
Upstart converters for every single component of the desktop.  This is
not helping us towards Utopia.

What's the alternative?

We could accept that HAL and D-BUS are part of the infrastructure, and
depend on them.

Instead of events and states we can use:

  * D-BUS signals

  * D-BUS methods (shutdown, etc.)

  * D-BUS Names existing

  * HAL Device objects existing

So you could go as far to define a job as running when any HAL Device
exists that have the "camera" capability.

Of course, the obvious problem here is figuring out exactly how this
would work, and how do we pass the information about the HAL Device to
the processes being run by the job?

This matters, because it defines whether Upstart is limited to simply
managing the traditional services or whether it can manage the
super-modern services that we're going to see more and more of.

The concern is that by limiting it to traditional services, we're
limiting its lifetime because there won't be many of those left before
long.

Scott
-- 
Have you ever, ever felt like this?
Had strange things happen?  Are you going round the twist?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/upstart-devel/attachments/20070912/6ddae58d/attachment.pgp