<div dir="ltr">There's been some discussion recently about adding some feature to Juju to allow developers or CI tests to intentionally trigger otherwise hard to induce failures in specific parts of Juju. The idea is that sometimes we need some kind of failure to happen in a CI test or when manually testing but those failures can often be hard to make happen.<div>
<br></div><div>For example, for changes Juju's upgrade mechanics that I'm working on at the moment I would like to ensure that an upgrade is cleanly aborted if one of the state servers in a HA environment refuses to start the upgrade. This logic is well unit tested but there's nothing like seeing it actually work in a real environment to build confidence - however, it isn't easy to make a state server misbehave in this way.</div>
<div><br></div><div>To help with this kind of testing scenario, I've created a new top-level package called "wrench" which lets us "drop a wrench in the works" so to speak. It's very simple with one main API which can be called from judiciously chosen points in Juju's execution to decide whether some failure should be triggered. </div>
<div><br></div><div>The module looks for files in $jujudatadir/wrench (typically /var/lib/juju/wrench) on the local machine. If I wanted to trigger the upgrade failure described above I could drop a file in that directory on one of the state servers named say "machine-agent" with the content:</div>
<div><br></div><div><div><div><font face="courier new, monospace">refuse-upgrade</font></div><div><br></div></div><div>Then in some part of jujud's upgrade code there could be a check like:</div><div><br></div><div><font face="courier new, monospace">if wrench.IsActive("machine-agent", "refuse-upgrade") {</font></div>
<div><font face="courier new, monospace"> // trigger the failure</font></div><div><font face="courier new, monospace">}</font></div><div><br></div><div>The idea is this check would be left in the code to aid CI tests and future manual tests.</div>
<div><br></div><div>You can see the incomplete wrench package here: <a href="https://github.com/juju/juju/pull/508">https://github.com/juju/juju/pull/508</a></div></div><div><br></div><div>There are a few issues to nut out. </div>
<div><br></div><div>1. It needs to be difficult/impossible for someone to accidentally or maliciously activate this feature, especially in production environments. I have almost finished (but not pushed to Github) some changes to the wrench package which make it strict about the ownership and permissions on the wrench files. This should make it harder for the wrong person to drop files in to the wrench directory.</div>
<div><br></div><div>The idea has also been floated to only enable this functionality in non-stable builds. This certainly gives a good level of protection but I'm slightly wary of this approach because it makes it impossible for CI to take advantage of the wrench feature when testing stable release builds. I'm happy to be convinced that the benefit is worth the cost.</div>
<div><br></div><div>Other ideas on how to better handle this are very welcome.</div><div><br></div><div>2. The wrench functionality needs to be disabled during unit test runs because we don't want any wrench files a developer may have lying around to affect Juju's behaviour during test runs. The wrench package has a global on/off switch so I plan on switching it off in BaseSuite's setup or similar.</div>
<div><br></div><div>3. The name is a bikeshedding magnet :) Other names that have been bandied about for this feature are "chaos" and "spanner". I don't care too much so if there's a strong consensus for another name let's use that. I chose "wrench" over "spanner" because I believe that's the more common usage in the US and because Spanner is a DB from Google. Let's not get carried away...</div>
<div><br></div><div>All comments, ideas and concerns welcome.</div><div><br></div><div>- Menno</div><div><br></div><div><br></div></div>