Intentionally introducing failures into Juju

Thu Aug 14 00:31:06 UTC 2014

I like the idea being able to trigger failures using the juju command line.

I'm undecided about how the need to fail should be stored. An obvious
location would be in a new collection managed by state, or even as a field
on existing state objects and documents. The downside of this approach is
that a connection to state will then need to be available from where-ever
we would like failures to be triggered - this isn't always possible or
convenient.

Another approach would be to have "juju inject-failure" drop files in some
location (along the lines of what I've already implemented) using SSH. This
has the advantage of making the failure checks easy to perform from
anywhere with the disadvantage of making it more difficult to manage
existing failures. There would also be some added complexity when creating
failure files for about-to-be-created entities (e.g. the "juju deploy
--inject-failure" case).

Do you have any thoughts on this?

On 14 August 2014 02:25, Gustavo Niemeyer <gustavo.niemeyer at canonical.com>
wrote:

> That's a nice direction, Menno.
>
> The main thing that comes to mind is that it sounds quite inconvenient
> to turn the feature on. It may sound otherwise because it's so easy to
> drop files at arbitrary places in our local machines, but when dealing
> with a distributed system that knows how to spawn its own resources
> up, suddenly the "just write a file" becomes surprisingly boring and
> race prone.
>
> What about:
>
>     juju inject-failure [--unit=unit] [--service=service] <failure name>"?
>     juju deploy [--inject-failure=name] ...
>
>
>
> On Wed, Aug 13, 2014 at 7:17 AM, Menno Smits <menno.smits at canonical.com>
> wrote:
> > There's been some discussion recently about adding some feature to Juju
> to
> > allow developers or CI tests to intentionally trigger otherwise hard to
> > induce failures in specific parts of Juju. The idea is that sometimes we
> > need some kind of failure to happen in a CI test or when manually testing
> > but those failures can often be hard to make happen.
> >
> > For example, for changes Juju's upgrade mechanics that I'm working on at
> the
> > moment I would like to ensure that an upgrade is cleanly aborted if one
> of
> > the state servers in a HA environment refuses to start the upgrade. This
> > logic is well unit tested but there's nothing like seeing it actually
> work
> > in a real environment to build confidence - however, it isn't easy to
> make a
> > state server misbehave in this way.
> >
> > To help with this kind of testing scenario, I've created a new top-level
> > package called "wrench" which lets us "drop a wrench in the works" so to
> > speak. It's very simple with one main API which can be called from
> > judiciously chosen points in Juju's execution to decide whether some
> failure
> > should be triggered.
> >
> > The module looks for files in $jujudatadir/wrench (typically
> > /var/lib/juju/wrench) on the local machine. If I wanted to trigger the
> > upgrade failure described above I could drop a file in that directory on
> one
> > of the state servers named say "machine-agent" with the content:
> >
> > refuse-upgrade
> >
> > Then in some part of jujud's upgrade code there could be a check like:
> >
> > if wrench.IsActive("machine-agent", "refuse-upgrade") {
> >      // trigger the failure
> > }
> >
> > The idea is this check would be left in the code to aid CI tests and
> future
> > manual tests.
> >
> > You can see the incomplete wrench package here:
> > https://github.com/juju/juju/pull/508
> >
> > There are a few issues to nut out.
> >
> > 1. It needs to be difficult/impossible for someone to accidentally or
> > maliciously activate this feature, especially in production
> environments. I
> > have almost finished (but not pushed to Github) some changes to the
> wrench
> > package which make it strict about the ownership and permissions on the
> > wrench files. This should make it harder for the wrong person to drop
> files
> > in to the wrench directory.
> >
> > The idea has also been floated to only enable this functionality in
> > non-stable builds. This certainly gives a good level of protection but
> I'm
> > slightly wary of this approach because it makes it impossible for CI to
> take
> > advantage of the wrench feature when testing stable release builds. I'm
> > happy to be convinced that the benefit is worth the cost.
> >
> > Other ideas on how to better handle this are very welcome.
> >
> > 2. The wrench functionality needs to be disabled during unit test runs
> > because we don't want any wrench files a developer may have lying around
> to
> > affect Juju's behaviour during test runs. The wrench package has a global
> > on/off switch so I plan on switching it off in BaseSuite's setup or
> similar.
> >
> > 3. The name is a bikeshedding magnet :)  Other names that have been
> bandied
> > about for this feature are "chaos" and "spanner". I don't care too much
> so
> > if there's a strong consensus for another name let's use that. I chose
> > "wrench" over "spanner" because I believe that's the more common usage in
> > the US and because Spanner is a DB from Google. Let's not get carried
> > away...
> >
> > All comments, ideas and concerns welcome.
> >
> > - Menno
> >
> >
> >
> > --
> > Juju-dev mailing list
> > Juju-dev at lists.ubuntu.com
> > Modify settings or unsubscribe at:
> > https://lists.ubuntu.com/mailman/listinfo/juju-dev
> >
>
> --
> gustavo @ http://niemeyer.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140814/67590a19/attachment.html>