Packaging, recovery, and random fault injection

Kapil Thangavelu kapil.thangavelu at canonical.com
Mon Apr 25 15:38:34 UTC 2011


Hi Folks,

While testing out natty server with ensemble and the new region
portability functionality, Jim and I noticed that things where failing
to deploy properly.  I tracked down further to various unit agents
segfaulting mid operation, which is unrelated to the region
portability. Last summer when ensemble work was kicking off with
twisted zookeeper bindings, we had several interactions with upstream
to incorporate various bug fixes related to the zookeeper python
bindings. Unfortunately these fixes haven't landed in the ubuntu
packages, and our unit agents have grown to sufficient complexity that
they trigger these bugs, and segfault.

While pondering this, it also became apparent that these segfaults of
the unit agent, also represent a nice opportunity to increase the
robustness of ensemble, to enable recovery of dead agents. We have
several new and old tickets now relating to this.

More immediately though we need to fix this issue, as it prevents
usage of ensemble. I had previously done some bespoke work on
packaging newer versions of zookeeper and creating some basic debian
packaging for txzookeeper and ensemble. Clint has cleaned up the
latter two and assembled a ensemble ppa.  I had tried to get this
building via launchpad package building, but had some difficulties
based on various errors and long (multi-day at the time) roundtrips
for feedback on the process. I ended up constructing a simple
standalone script (ensemble/debian/ec2-build) which built the packages
on an ec2 instance as an interim solution. For the immediate future
i'm suggesting, rebuilding the ensemble ami with these binaries.
Longer term, it would be nice to get some additional help from the
server team on getting these core ensemble dependencies packaged
nicely.

There was some talk last cycle of using cloudera's hadoop distribution
or at least including it in the partners repo. Unfortunately the
cloudera distribution lacks the zookeeper-python binding which is the
foundational layer for ensemble. As is the version in natty is the
same as that for maverick.


I'm proposing two separate tracks then.

 - Rebuild the ensemble ec2 images, to include working versions of
   zookeeper. In future getting current upstream versions of zookeeper
   into oneiric.

 - Increase ensemble's fault tolerance of agents that die. Machine
   agents monitoring unit agents, and provisioning agents monitoring
   machine agents.  Additionally unit agents maintaining on disk
   queues of pending hook executions that they can recover from on
   startup.


cheers,

Kapil




More information about the Ensemble mailing list