Automatic retries of hooks

Wed Jan 20 17:41:25 UTC 2016

On Wed, Jan 20, 2016 at 3:22 PM, Rick Harding <rick.harding at canonical.com>
wrote:

> +1 retries are great, with backoff, when you know you're doing it because
> you have experience that certain api requests to clouds, or to other known
> failure points.
>

If you're thinking about it in terms of "known failure points" you already
understand that you need a wide net to catch all the retryable errors that
could come out of a given operation. What makes hook execution different
from any other code that we want to be reliable?

Blindly just saying "if at first you don't succeed, go go go" isn't a
> better UX. It adds another layer of complexity in debugging, and doesn't
> really improve the product. Only the charm author knows enough about what
> it's trying to achieve to do intelligent retry.
>

Empirically, it seems that the retries caused jamespage's charm succeed
where it would have failed; and we have happy results from Gabriel's
windows charms as well. That STM to be evidence that the product is
improved...

In this case, if there's something about unexpected reboots of machines,
> perhaps there's some specific case that Juju can grow some intelligence and
> hint at the charm author what happened. The charm can then react to that
> information as it deems necessary.
>

It's not really about reboots. It's that we can't reliably distinguish
between all the cases that could cause us to record the start of a hook
execution but not its completion -- hook errors, context-flush-failure,
oom-killed-jujud, reboots, plain ol' bugs -- and that most of those don't
deserve a freak-out stop-the-world no-more-hooks reaction [0]. And even
when they *do* represent real problems with the deployment, the RTTD is to
set status and move on *without* hook error, because a hook error prevents
the unit from reacting to changes and fixing itself when it can.

Helpful?
Cheers
William

[0] and ofc that is not a comprehensive list, there will always be more
ways we might fail -- adding heuristics and special handling for the
various cases will never be perfect, and will just make us less predictable
and less reliable.

> On Wed, Jan 20, 2016 at 8:42 AM Dean Henrichsmeyer <dean at canonical.com>
> wrote:
>
>> Hi,
>>
>> It seems the original point James was making is getting missed. No one is
>> arguing over the value of being able to retry and/or idempotent hooks.
>> Yes, you should be able to retry them and yes nothing should break if you
>> run them over and over.
>>
>> The point made is that Juju shouldn't be automatically retrying them. The
>> argument of "no one knows what went wrong so Juju automatically retrying
>> them is a better experience" doesn't work. The intelligence of the stack in
>> question, regardless of what it is, goes in the charms. If you start
>> conflating and mixing up where the intelligence goes then creating,
>> running, and debugging those distributed systems will be a nightmare.
>>
>> The magic should only be in Juju's ability to effectively drive the
>> models and intelligence encoded in the charms. It shouldn't make
>> assumptions about what that intelligence is or what those models require.
>>
>> Thanks.
>>
>>
>> -Dean
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160120/f8008599/attachment-0001.html>