agent upgrading

Fri Jun 8 17:07:56 UTC 2012

We'd like to be able to upgrade a running juju with new software.
This is a scheme I originally came up with for upgrading
minor (database-compatible) versions, somewhat modifed after
discussion with Gustavo:

Client:
	- Push new version of tools.
	- Set new global version number in state.

Machine agent:
	- Wait for global version to change.
	- Download new version.
	- Copy version to where the local agents can see it
	- Set new version in local agents' state.
	- Point "current" symlink to new version
	- Exit and let upstart start new version

Other agent (provisioning or unit agent):
	- Wait for agent's version to change.
	- Point "current" symlink to new version
	- Exit and let upstart start new version

I think we should be able to do better than this. The problem is with
the "exit and let upstart start new version" step - that means that if
we happen to upload a broken version, then everything instantly breaks
and needs manually restoring.

Here are some desirable features for an upgrade facility:

	1. Uploading a broken tool shouldn't break anything.
	2. ... even for a short while.
	3. We shouldn't rely too heavily on upstart, given the possiblity
	of ports to systems without upstart.

Obviously point 1 is not entirely attainable - a tool can be broken in
any number of subtle ways that are not quickly detectable.

However, if we each tool does a set of checks at startup time (checking
the version in the zk database and other dependencies) then I think that
the likelyhood of breakage can be drastically reduced.

Here is a proposal for a scheme that addresses the above three points. It
remains the same as the original scheme, with the exception instead of
letting upstart start the agents directly, we interpose a intermediary,
say "upgrader". This tool would be designed to be small, well verified
and with minimal dependencies - designed to need upgrading very seldom.

The final "exit and let upstart start new version" step is replaced with
the following upgrade path.

	- The agent asks the upgrader to run a new version of the agent
	by passing it the name of an executable and arguments.

	- The upgrader starts the new agent.

	- The new agent connects to the state, does whatever verification
	is necessary, and notifies the upgrader that it has successfully
	started (but doesn't actually *do* anything yet).

	- The upgrader notifies the original agent that the upgrade has
	been successful.

	- The original agent shuts down, notifies the upgrader that it
	has done so, and exits.

	- The upgrader notifies the new agent that it can continue,
	picking up the work where the old agent stopped.

A particular advantage of this scheme is that an upgraded agent
will cause no down time, even if the new agent hangs for a long
time when starting.

If the agent exits without upgrading, then the upgrader tool
will also exit, leaving upstart (or similar mechanism) to restart it;
this provides a way to upgrade the upgrader itself.

One possible drawback is that the new and the old
agent running side-by-side might be a problem in
resource-constrained environments. I don't think this would
be a problem in practice (most resources will probably be taken
as the agent continues to run, rather than at initialisation time),
and we can work around it if necessary, by having a way to
tell the upgrader to run the programs sequentially, and
back off to the previous version if the upgraded version fails.

I have a prototype of this proposal here (WIP, untested as yet):

	https://codereview.appspot.com/6307061

The upgrader in this implementation uses stdin and stdout to talk to
its child processes;
another mechanism could be substituted if desired.

Major Version Upgrades (sketch)

Major version (database) upgrades can fit into the above scheme by
providing an additional synchronisation step. The client could give the
global version a "pending" tag and wait for all agents to indicate that
they are halted. Then it would upgrade the database, untag the version
and let the upgrade proceed as usual.