Packaging, recovery, and random fault injection

Kapil Thangavelu kapil.thangavelu at canonical.com
Wed Apr 27 13:59:09 UTC 2011


Excerpts from Clint Byrum's message of Mon Apr 25 23:28:50 -0400 2011:
> Excerpts from Gustavo Niemeyer's message of Mon Apr 25 13:00:44 -0700 2011:
> 
> > >  - Increase ensemble's fault tolerance of agents that die. Machine
> > >   agents monitoring unit agents, and provisioning agents monitoring
> > >   machine agents.
> > 
> > Can't we a dumb watchdog restarting the process in case it crashes,
> > rather than making the agents more complex?
> > 
> 
> upstart would suffice here
> 
> Just put this in /etc/init/ensemble-agent.conf:
> 
> stop on runlevel [!2345]
> respawn
> exec /path/to/your/agent
> 
> 
> Then start with 'start ensemble-agent'.
> 
> The default limit on respawns is 10 times in 5 seconds. After that
> manual intervention is required. It can be changed with:
> 
> respawn limit 10 5
> 
> I'm not sure what the 'start on' would be... runlevel [2345] would be
> a traditional service, but I think it may need to come before that.

Upstart definitely fits the bill. Its a big hammer for just utilizing its respawn
capabilities, but it works well. I think we'd go ahead and switch 'start on' to 
'manual' and have either cloud-init (for provisioning, machine) start the agents,
or in the case of unit agents, have the machine agent do it. 

I filed the following bug for upstart integration.
https://bugs.launchpad.net/ensemble/+bug/770482

cheers,

Kapil




More information about the Ensemble mailing list