start/stop hook guarantees

Tue Dec 6 18:14:16 UTC 2011

Excerpts from William Reade's message of Tue Dec 06 07:50:28 -0800 2011:
> On Sat, 2011-12-03 at 15:01 -0200, Gustavo Niemeyer wrote:
> > This all makes sense to me, William. Thanks for the write up and the heads up.
> 
> Sadly, it's started to make much less sense to me as I delve deeper into
> the state restoration work. As I understand it, the intent is to allow
> us to smoothly transition back into a "started" state, and that doesn't
> sound like a bad goal in itself. However, consider the states the unit
> workflow can be in:
> 
> * None
> 
> We don't want to explicitly run the start hook here; the service hasn't
> even been installed, and the normal process of starting the unit agent
> will lead us through "installed" to "started" regardless.
> 
> * installed
> 
> As above; normal startup will transition us to "started" anyway.
> 
> * install_error
> 
> The chances of "start" working correctly are minimal; and, if it doesn't
> work, what should we do anyway? Switch to "start_error", and obscure the
> real cause of the failure?
> 
> * started
> 
> I guess it can't hurt, in the case of a charm that doesn't use upstart
> or otherwise monitor itself.
> 
> * start_error
> 
> May as well retry, I suppose (but I'm not sure what justification we
> have for believing the result to be any different, or why this case is
> special enough to overrule our preference for requiring explicit user
> action to resolve error states).
> 
> * configure_error
> 
> Whether it works or not, a transition to "started" or "start_error" is
> going to be profoundly misleading.
> 
> * charm_upgrade_error
> 
> Definitely a Bad Thing; we'll be breaking the guarantee that the
> upgrade-charm hook will be the *first* one called after the charm
> upgrade operation.
> 
> * stopped
> 
> Based on IRC discussion today, "stopped" should mean "the unit has gone
> away and is never coming back" [0], and so if by some freak occurrence
> we *do* restart a machine, and the unit agent comes up "stopped", we
> definitely don't want to start it again.
> 
> * stop_error
> 
> As above; we can't do anything meaningful from this state, and starting
> from this state is actively wrong.
> 
> 
> ...so. Assuming we still want to enable the weakly-written charms
> discussed previously, I think it makes much more sense to offer a *much*
> more limited guarantee; that, on the first run after reboot, the "start"
> hook will be called again if the unit is in a "started" state.
> 
> The "start" hook may of course be called as a result of the unit
> starting off in None or "installed", but that'd happen anyway, so it
> doesn't need explicit mention.
> 
> Does this make sense?

Yes definitely, this sounds like the right, specific guarantee to make.

Thanks for running through the possibilities. :)