Packaging, recovery, and random fault injection
kapil.thangavelu at canonical.com
Wed Apr 27 13:59:09 UTC 2011
Excerpts from Clint Byrum's message of Mon Apr 25 23:28:50 -0400 2011:
> Excerpts from Gustavo Niemeyer's message of Mon Apr 25 13:00:44 -0700 2011:
> > > - Increase ensemble's fault tolerance of agents that die. Machine
> > > agents monitoring unit agents, and provisioning agents monitoring
> > > machine agents.
> > Can't we a dumb watchdog restarting the process in case it crashes,
> > rather than making the agents more complex?
> upstart would suffice here
> Just put this in /etc/init/ensemble-agent.conf:
> stop on runlevel [!2345]
> exec /path/to/your/agent
> Then start with 'start ensemble-agent'.
> The default limit on respawns is 10 times in 5 seconds. After that
> manual intervention is required. It can be changed with:
> respawn limit 10 5
> I'm not sure what the 'start on' would be... runlevel  would be
> a traditional service, but I think it may need to come before that.
Upstart definitely fits the bill. Its a big hammer for just utilizing its respawn
capabilities, but it works well. I think we'd go ahead and switch 'start on' to
'manual' and have either cloud-init (for provisioning, machine) start the agents,
or in the case of unit agents, have the machine agent do it.
I filed the following bug for upstart integration.
More information about the Ensemble