<div dir="ltr">On Fri, May 6, 2016 at 5:50 PM, Eric Snow <span dir="ltr"><<a href="mailto:eric.snow@canonical.com" target="_blank">eric.snow@canonical.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">See <a href="https://bugs.launchpad.net/juju-core/+bug/1514874" rel="noreferrer" target="_blank">https://bugs.launchpad.net/juju-core/+bug/1514874</a>.<br></blockquote><div><br></div><div>Mainly, we just shouldn't do it. The only *actual reason* for us to do this is to support the manual provider; any other machine that happens to be dead will be cleaned up by the responsible provisioner in good time. There's a file we write to enable the suicidal behaviour when we enlist a manual machine, and we *shouldn't* have ever written it for any other reason.<br><br></div><div>But then people started adding post-death cleanup logic to the agent, and the only way to trigger that was to write that file; so they started writing that file on normal shutdown paths, so that they could trigger the post-death logic in all cases, and I can only assume they decided that nuking the installation was acceptable precisely *because* the provisioner would be taking down the machine anyway. And note that because there *is* a responsible provisioner that will be taking the machine down, that shutdown logic might or might not happen, rendering it ultimately pretty unreliable [0] as well as spectacularly terrible in the lp:1514874 cases. /sigh<br></div><div><br></div><div>So, narrowly, fixing this involves relocating the more-widely-useful (in-container loop device detach IIRC?) code inside MachineAgent.uninstall; and then just not ever writing the file that triggers actual uninstall, putting it back to where it was intended: a nasty but constrained hack to avoid polluting manual machines (which are only "borrowed") any more than necessary.<br></div><div> <br></div><div>(A bit more broadly: ErrTerminateAgent is kinda dumb, but not *particularly* harmful in practice [1] without the file of death backing it up. The various ways in which, e.g. api connection failure can trigger it are a bit enthusiastic, perhaps, but if *all it did was terminate the agent* it'd be fine: someone who, e.g. changed their agent.conf's password would invoke jujud and see it stop -- can't do anything with this config -- but be perfectly able to work again if the conf were fixed. Don't know what the deal is with the correct-password-refused thing, though: that should be investigated further.)<br><br></div><div>The tricky bit is relocating the cleanup code in uninstall -- I suspect the reason this whole sorry saga kicked off is because whoever added it was in a hurry and thought they didn't have a place to put this logic (inside the machiner, *before* setting Dead, would be closest to correct -- but that'd require us to synchronise with any other workers that might use the resources we want to clean up, and I suspect that would have been the roadblock). Andrew, I think you had more detail last time we discussed this: is there anything else in uninstall (besides loop-device stuff) that needs to run *anywhere* except a manual machine? and, what will we actually need to sync with in the machiner? (or, do you have alternative ideas?)<br></div><br></div><div class="gmail_quote">Cheers<br></div><div class="gmail_quote">William<br></div><div class="gmail_quote"><div><br></div><div>[0] although with the watcher resolution of 5s, it's actually got a pretty good chance of running.<br></div><div>[1] still utterly terrible in theory, it would be lovely to see explicit handover between contexts -- e.g. it is *not* the machiner's job to return ErrTerminateAgent, it is whatever's-responsible-for-the-machiner's job to handle a local ErrMachineDead and convert that as necessary for the context it's running in.<br><br></div><div>PS Oh, uh, couple of comments on the below:<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> 1. do nothing<br> This would be easy :) but does not help with the pain point.<br> 2. be smarter about retrying (e.g. retry connecting to API) when<br> running into fatal errors<br> This would probably be good to do but the effort might not pay for itself.<br> 3. do not clean up (data dir, init service, or either)<br> Leaving the init service installed really isn't an option because<br> we don't want<br> the init service to try restarting the agent over and over.<br> Leaving the data dir<br> in place isn't good because it will conflict with any new agent<br> dir the controller<br> tries to put on the instance in the future.<br></blockquote><div><br></div><div>we don't need to worry about the init service because we're set up not to be restarted if we return 0, which we should on ErrTerminateAgent.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> 4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the<br> machine/model config or as a sentinel file on instances) and do not<br> clean up if it is set to true (default?)<br> This would provide a reasonable quick fix for the above bug, even if<br> temporary and even if it defaults to false.<br> 5. disable (instead of uninstall) the init services and move the data<br> dir to some obvious but out-of-the-way place (e.g.<br> /home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of<br> deleting it<br> This is a reasonable longer term solution as the concerns<br> described for 3 are<br> addressed and manual intervention becomes more feasible.<br> 6. same as 5 but also provide an "enable-juju" (or "revive-juju",<br> "repair-juju", etc.) command *on the machine* that would re-enable the<br> init services and restore (or rebuild) the agent's data dir<br> This would make it even easier to manually recover.<br> 7. first try to automatically recover (from machine or controller)<br> This would require a sizable effort for a problem that shouldn't<br> normally happen.<br> 8. do what the unit agent does<br> I haven't looked closely enough to see if this is a good fit.<br> <br> I'd consider 4 with a default of false to be an acceptable quick fix.<br> Additionally, I'll advocate for 6 (or 5) as the most appropriate<br> solution in support of manual recovery from "fatal" errors.<br></blockquote><div><br></div><div>I don't think we want any of this -- it's all a consequence of inappropriately triggering the uninstall stuff in the first place.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> -eric<br> <br> <br> * Could this lead to collisions if the instance is re-purposed as a<br> different machine? I suppose it could also expose sensitive data when<br> likewise re-purposed, since it won't necessarily be in the same model<br> or controller. However, given the need for admin access that probably<br> isn't a likely problem.<br></blockquote><div><br></div><div>If substrates are leaking data between separate instances, the substrate is screwed up, and there's not much we can do about it. The manual provider has no such guarantees, and that's the only justification for trying to clean up so drastically in the first place.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <span class="HOEnZb"><font color="#888888"><br> --<br> Juju-dev mailing list<br> <a href="mailto:Juju-dev@lists.ubuntu.com">Juju-dev@lists.ubuntu.com</a><br> Modify settings or unsubscribe at: <a href="https://lists.ubuntu.com/mailman/listinfo/juju-dev" rel="noreferrer" target="_blank">https://lists.ubuntu.com/mailman/listinfo/juju-dev</a><br> </font></span></blockquote></div><br></div></div>