Machine agents uninstall themselves upon worker.ErrTerminateAgent.

Fri May 6 17:37:53 UTC 2016

On Fri, May 6, 2016 at 5:50 PM, Eric Snow <eric.snow at canonical.com> wrote:

> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>

Mainly, we just shouldn't do it. The only *actual reason* for us to do this
is to support the manual provider; any other machine that happens to be
dead will be cleaned up by the responsible provisioner in good time.
There's a file we write to enable the suicidal behaviour when we enlist a
manual machine, and we *shouldn't* have ever written it for any other
reason.

But then people started adding post-death cleanup logic to the agent, and
the only way to trigger that was to write that file; so they started
writing that file on normal shutdown paths, so that they could trigger the
post-death logic in all cases, and I can only assume they decided that
nuking the installation was acceptable precisely *because* the provisioner
would be taking down the machine anyway. And note that because there *is* a
responsible provisioner that will be taking the machine down, that shutdown
logic might or might not happen, rendering it ultimately pretty unreliable
[0] as well as spectacularly terrible in the lp:1514874 cases. /sigh

So, narrowly, fixing this involves relocating the more-widely-useful
(in-container loop device detach IIRC?) code inside MachineAgent.uninstall;
and then just not ever writing the file that triggers actual uninstall,
putting it back to where it was intended: a nasty but constrained hack to
avoid polluting manual machines (which are only "borrowed") any more than
necessary.

(A bit more broadly: ErrTerminateAgent is kinda dumb, but not
*particularly* harmful in practice [1] without the file of death backing it
up. The various ways in which, e.g. api connection failure can trigger it
are a bit enthusiastic, perhaps, but if *all it did was terminate the
agent* it'd be fine: someone who, e.g. changed their agent.conf's password
would invoke jujud and see it stop -- can't do anything with this config --
but be perfectly able to work again if the conf were fixed. Don't know what
the deal is with the correct-password-refused thing, though: that should be
investigated further.)

The tricky bit is relocating the cleanup code in uninstall -- I suspect the
reason this whole sorry saga kicked off is because whoever added it was in
a hurry and thought they didn't have a place to put this logic (inside the
machiner, *before* setting Dead, would be closest to correct -- but that'd
require us to synchronise with any other workers that might use the
resources we want to clean up, and I suspect that would have been the
roadblock). Andrew, I think you had more detail last time we discussed
this: is there anything else in uninstall (besides loop-device stuff) that
needs to run *anywhere* except a manual machine? and, what will we actually
need to sync with in the machiner? (or, do you have alternative ideas?)

Cheers
William

[0] although with the watcher resolution of 5s, it's actually got a pretty
good chance of running.
[1] still utterly terrible in theory, it would be lovely to see explicit
handover between contexts -- e.g. it is *not* the machiner's job to return
ErrTerminateAgent, it is whatever's-responsible-for-the-machiner's job to
handle a local ErrMachineDead and convert that as necessary for the context
it's running in.

PS Oh, uh, couple of comments on the below:

1. do nothing
>     This would be easy :) but does not help with the pain point.
> 2. be smarter about retrying (e.g. retry connecting to API) when
> running into fatal errors
>     This would probably be good to do but the effort might not pay for
> itself.
> 3. do not clean up (data dir, init service, or either)
>     Leaving the init service installed really isn't an option because
> we don't want
>     the init service to try restarting the agent over and over.
> Leaving the data dir
>     in place isn't good because it will conflict with any new agent
> dir the controller
>     tries to put on the instance in the future.
>

we don't need to worry about the init service because we're set up not to
be restarted if we return 0, which we should on ErrTerminateAgent.

> 4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the
> machine/model config or as a sentinel file on instances) and do not
> clean up if it is set to true (default?)
>     This would provide a reasonable quick fix for the above bug, even if
>     temporary and even if it defaults to false.
> 5. disable (instead of uninstall) the init services and move the data
> dir to some obvious but out-of-the-way place (e.g.
> /home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of
> deleting it
>     This is a reasonable longer term solution as the concerns
> described for 3 are
>     addressed and manual intervention becomes more feasible.
> 6. same as 5 but also provide an "enable-juju" (or "revive-juju",
> "repair-juju", etc.) command *on the machine* that would re-enable the
> init services and restore (or rebuild) the agent's data dir
>     This would make it even easier to manually recover.
> 7. first try to automatically recover (from machine or controller)
>     This would require a sizable effort for a problem that shouldn't
> normally happen.
> 8. do what the unit agent does
>     I haven't looked closely enough to see if this is a good fit.
>
> I'd consider 4 with a default of false to be an acceptable quick fix.
> Additionally, I'll advocate for 6 (or 5) as the most appropriate
> solution in support of manual recovery from "fatal" errors.
>

I don't think we want any of this -- it's all a consequence of
inappropriately triggering the uninstall stuff in the first place.

>
> -eric
>
>
> * Could this lead to collisions if the instance is re-purposed as a
> different machine?  I suppose it could also expose sensitive data when
> likewise re-purposed, since it won't necessarily be in the same model
> or controller.  However, given the need for admin access that probably
> isn't a likely problem.
>

If substrates are leaking data between separate instances, the substrate is
screwed up, and there's not much we can do about it. The manual provider
has no such guarantees, and that's the only justification for trying to
clean up so drastically in the first place.

>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160506/d069cf4b/attachment.html>