Automatic retries of hooks
William Reade
william.reade at canonical.com
Wed Jan 20 10:33:58 UTC 2016
On Tue, Jan 19, 2016 at 3:14 PM, James Page <james.page at ubuntu.com> wrote:
>
> I think this is a dangerous behaviour to introduce to Juju; a hook error
> should be a signal to an end user that something really bad happened, and
> that they need to dig in further (preferably with points from status
> messages); if the function that a hook is performing is re-tryable, that
> needs to be handled in charm and not by Juju IMHO.
>
There are a few problems with this.
0) The function that a hook is performing *must* be retryable anyway. Hooks
need to be idempotent; we guarantee at-least-once execution, not
at-most-once.
1) As a user, what a hook error means in practice is "retry the hook" (good
thing all those hooks are idempotent...). Most users aren't in a position
to debug their charm if it goes wrong, so their only actual interaction is
basically a thoughtless pavlovian response, the absence of which can leave
an environment needlessly hosed until a human notices it. May as well
automate it for better UX *and* happier outcomes.
2) In any given hook, the ratio of known errors to possible errors is
approximately 0:1 [0]. Those infinitesimally few known errors should indeed
set statuses before failing out (even if you have to look in status history
to see them); but we have to be mindful of the vast majority of cases,
where we have *no idea* what could have gone wrong. And in that case, the
only functional response is to retry -- some unknown errors may be fatal,
but to *assume* they are risks locking up the system on every transient
blip.
3) Finally, now that you have the choice, I'd advise against in-hook
retries: (i) the longer you sit in one hook retrying, the longer all
colocated units are blocked [1]; and (ii) delegating the retries to the
infrastructure lets you write much much cleaner code [2].
Are there any concerns that I've missed?
Specifically I was testing some changes to the odl-controller charm; this
> feature covered up a race in the charm hook code accessing the API of ODL,
> which I failed to notice the first few times I deployed (not paying
> attention due to multi-tasking), and then had me scratching my head as to
> what was going on when I started to notice the hook failure.
>
You say "covered up a race", I say "automatically resolved the problem for
you" :-).
Cheers
William
[0] this applies to any code really, inside or outside juju, it's not
specific to hooks at all.
[1] and while it may not be *common* I'm pretty sure it'd be *possible* for
a hook to deadlock like this; would prefer not to encourage that.
[2] this is also widely applicable: adding retry logic *within* an
idempotent operation is basically always worse than building independent
operation-retrying infrastructure and reusing that where necessary.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160120/dd96a516/attachment.html>
More information about the Juju-dev
mailing list