Intentionally introducing failures into Juju

Wed Aug 13 10:17:04 UTC 2014

There's been some discussion recently about adding some feature to Juju to
allow developers or CI tests to intentionally trigger otherwise hard to
induce failures in specific parts of Juju. The idea is that sometimes we
need some kind of failure to happen in a CI test or when manually testing
but those failures can often be hard to make happen.

For example, for changes Juju's upgrade mechanics that I'm working on at
the moment I would like to ensure that an upgrade is cleanly aborted if one
of the state servers in a HA environment refuses to start the upgrade. This
logic is well unit tested but there's nothing like seeing it actually work
in a real environment to build confidence - however, it isn't easy to make
a state server misbehave in this way.

To help with this kind of testing scenario, I've created a new top-level
package called "wrench" which lets us "drop a wrench in the works" so to
speak. It's very simple with one main API which can be called from
judiciously chosen points in Juju's execution to decide whether some
failure should be triggered.

The module looks for files in $jujudatadir/wrench (typically
/var/lib/juju/wrench) on the local machine. If I wanted to trigger the
upgrade failure described above I could drop a file in that directory on
one of the state servers named say "machine-agent" with the content:

refuse-upgrade

Then in some part of jujud's upgrade code there could be a check like:

if wrench.IsActive("machine-agent", "refuse-upgrade") {
     // trigger the failure
}

The idea is this check would be left in the code to aid CI tests and future
manual tests.

You can see the incomplete wrench package here:
https://github.com/juju/juju/pull/508

There are a few issues to nut out.

1. It needs to be difficult/impossible for someone to accidentally or
maliciously activate this feature, especially in production environments. I
have almost finished (but not pushed to Github) some changes to the wrench
package which make it strict about the ownership and permissions on the
wrench files. This should make it harder for the wrong person to drop files
in to the wrench directory.

The idea has also been floated to only enable this functionality in
non-stable builds. This certainly gives a good level of protection but I'm
slightly wary of this approach because it makes it impossible for CI to
take advantage of the wrench feature when testing stable release builds.
I'm happy to be convinced that the benefit is worth the cost.

Other ideas on how to better handle this are very welcome.

2. The wrench functionality needs to be disabled during unit test runs
because we don't want any wrench files a developer may have lying around to
affect Juju's behaviour during test runs. The wrench package has a global
on/off switch so I plan on switching it off in BaseSuite's setup or similar.

3. The name is a bikeshedding magnet :)  Other names that have been bandied
about for this feature are "chaos" and "spanner". I don't care too much so
if there's a strong consensus for another name let's use that. I chose
"wrench" over "spanner" because I believe that's the more common usage in
the US and because Spanner is a DB from Google. Let's not get carried
away...

All comments, ideas and concerns welcome.

- Menno
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140813/c46b90a8/attachment.html>