Machine agents uninstall themselves upon worker.ErrTerminateAgent.

Andrew Wilkins andrew.wilkins at canonical.com
Mon May 9 07:56:52 UTC 2016


On Mon, May 9, 2016 at 2:28 PM William Reade <william.reade at canonical.com>
wrote:

> On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkins <
> andrew.wilkins at canonical.com> wrote:
>
>> On Sat, May 7, 2016 at 1:37 AM William Reade <william.reade at canonical.com>
>> wrote:
>>
>>> On Fri, May 6, 2016 at 5:50 PM, Eric Snow <eric.snow at canonical.com>
>>> wrote:
>>>
>>>> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>>>
>>>
>> So I think this issue is fixed in 2.0, but looks like the changes never
>> got backported to 1.25. From your options, we do have (the opposite of) a
>> DO_NOT_UNINSTALL file (it's actually called
>> "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall).
>>
>> (And now that I think of it, we're only writing uninstall-agent for the
>> manual provider's bootstrap machine, and not other manual machines, so
>> we're currently leaving Juju bits behind on manual machines added to an
>> environment.)
>>
>
> Except we're *also* writing it on every machine, for Very Bad Reasons,
> right? So we *are* still cleaning up all machines, but there's a latent
> manual provider bug that'll need addressing.
>

Yes, sorry, it does appear that we're doing it on all machines. Disregard
my parenthetical remark. And yes, we should really only write that file for
manual machines.

But... I've just looked at the 1.25 branch again, and the fix *was* made
there. And from Jorge's comment
https://bugs.launchpad.net/juju-core/+bug/1514874/comments/4, we can see
that the uninstall logic isn't actually running (see `uninstall file
"/var/lib/juju/uninstall-agent" does not exist`
https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1741)

I'm not sure what to make of that. Eric, have you confirmed that that code
is what's causing the issue? Are we sure we're not barking up the wrong
tree?

The reason it's done at the last moment is to avoid having dangling
>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>> remove systemd/upstart), then if the agent fails before we get to
>> EnsureDead, then the entity will never be removed from state.
>>
>
> The *only* thing that should happen after setting dead is the uninstall --
> anything else that's required to happen before cleanup *must* happen before
> setting dead, which *means* "all my responsibilities are 100% fulfilled".
>

I don't think I suggested above that we should do anything else other than
uninstall?

The *only* justification for the post-death logic in the manual case is
> because there's no responsible provisioner component to hand over to -- and
> frankly I wish we'd just written that to SSH in and clean up, instead of
> taking on this ongoing hassle.
>

>
As an alternative, we could (should) only ever write the
>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
>> there's no assigned units, and no storage attached.
>>
>
> Why would we *ever* want to write it at runtime? We know if it's a manual
> machine at provisioning time, so we can write the File Of Death OAOO. All
> the other mucking about with it is the source of these (serious!) bugs.
>

The point is not to distinguish between manual vs. non-manual. Yes, we can
write something that records that fact OAOO.

The point of "write from the machiner" was to signal that the machine is
actually dead, and removed from state, vs. "my credentials are invalid,
better shut down now".

So we can write a file to confine uninstall to manual machines -- that much
is easy, I don't think anyone will disagree with doing that. But we should
not ignore the bug that prompted this thread, even if it's confined to
manual machines.

Andrew, I think you had more detail last time we discussed this: is there
>>> anything else in uninstall (besides loop-device stuff) that needs to run
>>> *anywhere* except a manual machine? and, what will we actually need to sync
>>> with in the machiner? (or, do you have alternative ideas?)
>>>
>>
>> No, I don't think there is anything else to be done in uninstall, apart
>> from loop detach and manual machine cleanup. I'm not sure about moving the
>> uninstall logic to the machiner, for reasons described above. We could
>> improve the current state of affairs, though, by only writing the
>> uninstall-agent file from the machiner
>>
>
> Strong -1 on moving uninstall logic: if it has to happen (which it does,
> in *rare* cases that are *always* detectable pre-provisioning), uninstall
> is where it should happen, post-machine-death; and also strong -1 on
> writing uninstall-agent in *any* circumstances except manual machine
> provisioning, we have had *way* too many problems with this "clever"
> feature being invoked when it shouldn't be.
>

I don't want to belabour the point, but just to be clear: the
uninstall-agent file exists to record the fact that he machine is in fact
Dead, and uninstall should go ahead. That logic was put in specifically to
prevent the referenced bug. We can and should improve it to only do this
for manual machines.

FWIW, the loop stuff can be dropped when the LXC container support is
>> removed. Nobody ever added support for loop in the LXD provider, and I
>> think we should implement support for it differently to how it was done for
>> LXC anyway (losetup on host, expose to container; as opposed to expose all
>> loop devices to all LXD containers and losetup in container).
>>
>
> +1000 to that. So... can't we just (1) fix the manual provisioning to
> write the file; (2) drop all other use of uninstall-agent; (3) drop the
> lxc-specific logic in uninstall -- and then we're done?
>

For first steps, I think so. But I do think we should fix
https://bugs.launchpad.net/juju-core/+bug/1514874 for manual machines as
well. So:

 (1) record (in agent config?) that a machine is manual
 (2) only ever do anything uninstall-related for manual machines
 (3) only ever do uninstall-related things if the machine actually is Dead
 (4) drop lxc-specific logic from uninstall *when LXC support is removed*

Cheers,
Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160509/35d8a245/attachment.html>


More information about the Juju-dev mailing list