Packaging, recovery, and random fault injection

Gustavo Niemeyer gustavo.niemeyer at
Mon Apr 25 20:00:44 UTC 2011

> Longer term, it would be nice to get some additional help from the
> server team on getting these core ensemble dependencies packaged
> nicely.

Indeed.  IIRC packaging an updated ZooKeeper was already pretty high
on the list of things, so let's talk to see if we can get that going
sooner rather than later.

> I'm proposing two separate tracks then.
>  - Rebuild the ensemble ec2 images, to include working versions of
>   zookeeper. In future getting current upstream versions of zookeeper
>   into oneiric.

+1, let's talk about that.

>  - Increase ensemble's fault tolerance of agents that die. Machine
>   agents monitoring unit agents, and provisioning agents monitoring
>   machine agents.

Can't we a dumb watchdog restarting the process in case it crashes,
rather than making the agents more complex?

Either way, having the provisioning agent fiddling with the machine
agent process certainly sounds a bit awkward, since they may be in
separate machines.

> Additionally unit agents maintaining on disk
> queues of pending hook executions that they can recover from on
> startup.

Sounds good.

Gustavo Niemeyer

More information about the Ensemble mailing list