Packaging, recovery, and random fault injection
gustavo.niemeyer at canonical.com
Mon Apr 25 20:00:44 UTC 2011
> Longer term, it would be nice to get some additional help from the
> server team on getting these core ensemble dependencies packaged
Indeed. IIRC packaging an updated ZooKeeper was already pretty high
on the list of things, so let's talk to see if we can get that going
sooner rather than later.
> I'm proposing two separate tracks then.
> - Rebuild the ensemble ec2 images, to include working versions of
> zookeeper. In future getting current upstream versions of zookeeper
> into oneiric.
+1, let's talk about that.
> - Increase ensemble's fault tolerance of agents that die. Machine
> agents monitoring unit agents, and provisioning agents monitoring
> machine agents.
Can't we a dumb watchdog restarting the process in case it crashes,
rather than making the agents more complex?
Either way, having the provisioning agent fiddling with the machine
agent process certainly sounds a bit awkward, since they may be in
> Additionally unit agents maintaining on disk
> queues of pending hook executions that they can recover from on
More information about the Ensemble