Packaging, recovery, and random fault injection

Tue Apr 26 03:28:50 UTC 2011

Excerpts from Gustavo Niemeyer's message of Mon Apr 25 13:00:44 -0700 2011:
> > Longer term, it would be nice to get some additional help from the
> > server team on getting these core ensemble dependencies packaged
> > nicely.
> 
> Indeed.  IIRC packaging an updated ZooKeeper was already pretty high
> on the list of things, so let's talk to see if we can get that going
> sooner rather than later.

I think the progression for all of the dependencies should probably be

Package in ppa:ensemble/ppa
Upload to Ubuntu dev release
Upload to Ubuntu backports for latest LTS

> 
> > I'm proposing two separate tracks then.
> >
> >  - Rebuild the ensemble ec2 images, to include working versions of
> >   zookeeper. In future getting current upstream versions of zookeeper
> >   into oneiric.
> 
> +1, let's talk about that.
> 

Oneiric should open for dev around May 5:

https://wiki.ubuntu.com/OneiricReleaseSchedule

If we follow the progression above, we should already have all the latest
dependencies packaged in the PPA. Only things like copyright and metadata
will need to be checked by May 5.

We'll talk at UDS about this, but I'm 99% sure we'll want to MIR zookeeper
and txaws/txzookeeper in Oneiric.

> >  - Increase ensemble's fault tolerance of agents that die. Machine
> >   agents monitoring unit agents, and provisioning agents monitoring
> >   machine agents.
> 
> Can't we a dumb watchdog restarting the process in case it crashes,
> rather than making the agents more complex?
> 

upstart would suffice here

Just put this in /etc/init/ensemble-agent.conf:

stop on runlevel [!2345]
respawn
exec /path/to/your/agent

Then start with 'start ensemble-agent'.

The default limit on respawns is 10 times in 5 seconds. After that
manual intervention is required. It can be changed with:

respawn limit 10 5

I'm not sure what the 'start on' would be... runlevel [2345] would be
a traditional service, but I think it may need to come before that.

> Either way, having the provisioning agent fiddling with the machine
> agent process certainly sounds a bit awkward, since they may be in
> separate machines.
> 
> > Additionally unit agents maintaining on disk
> > queues of pending hook executions that they can recover from on
> > startup.
> 
> Sounds good.
> 

+1 indeed, Sounds like a nice robust design. :)