Machine agents uninstall themselves upon worker.ErrTerminateAgent.

Mon May 9 09:03:46 UTC 2016

On Mon, May 9, 2016 at 4:38 PM William Reade <william.reade at canonical.com>
wrote:

> On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins <
> andrew.wilkins at canonical.com> wrote:
>
>> The reason it's done at the last moment is to avoid having dangling
>>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>>> remove systemd/upstart), then if the agent fails before we get to
>>> EnsureDead, then the entity will never be removed from state.
>>>
>>
>> The *only* thing that should happen after setting dead is the uninstall
>> -- anything else that's required to happen before cleanup *must* happen
>> before setting dead, which *means* "all my responsibilities are 100%
>> fulfilled".
>>
>
> > I don't think I suggested above that we should do anything else other
> than uninstall?
>
> Sorry I misunderstood -- I was focused on the loop-device stuff that had
> grown in the uninstall logic.
>
>
>> The *only* justification for the post-death logic in the manual case is
>>> because there's no responsible provisioner component to hand over to -- and
>>> frankly I wish we'd just written that to SSH in and clean up, instead of
>>> taking on this ongoing hassle.
>>>
>>
>>>
>> As an alternative, we could (should) only ever write the
>>>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
>>>> there's no assigned units, and no storage attached.
>>>>
>>>
>>> Why would we *ever* want to write it at runtime? We know if it's a
>>> manual machine at provisioning time, so we can write the File Of Death
>>> OAOO. All the other mucking about with it is the source of these (serious!)
>>> bugs.
>>>
>>
>> The point is not to distinguish between manual vs. non-manual. Yes, we
>> can write something that records that fact OAOO.
>>
>> The point of "write from the machiner" was to signal that the machine is
>> actually dead, and removed from state, vs. "my credentials are invalid,
>> better shut down now".
>>
>
> Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on
> mere invalid-credentials. (No worker should return ErrTerminateAgent *at
> all*, really -- not their job. Having a couple of different workers that
> can return ErrAgentEntityDead, which can then be interpreted by the next
> layer? Fine :).)
>
>
>> So we can write a file to confine uninstall to manual machines -- that
>> much is easy, I don't think anyone will disagree with doing that. But we
>> should not ignore the bug that prompted this thread, even if it's confined
>> to manual machines.
>>
>
> Right; and, yeah, it's almost certainly better to leak manual machines
> than it is to uninstall them accidentally -- so long as we know that's the
> tradeoff we're making ;).
>
>
>> Andrew, I think you had more detail last time we discussed this: is there
>>>>> anything else in uninstall (besides loop-device stuff) that needs to run
>>>>> *anywhere* except a manual machine? and, what will we actually need to sync
>>>>> with in the machiner? (or, do you have alternative ideas?)
>>>>>
>>>>
>>>> No, I don't think there is anything else to be done in uninstall, apart
>>>> from loop detach and manual machine cleanup. I'm not sure about moving the
>>>> uninstall logic to the machiner, for reasons described above. We could
>>>> improve the current state of affairs, though, by only writing the
>>>> uninstall-agent file from the machiner
>>>>
>>>
>>> Strong -1 on moving uninstall logic: if it has to happen (which it does,
>>> in *rare* cases that are *always* detectable pre-provisioning), uninstall
>>> is where it should happen, post-machine-death; and also strong -1 on
>>> writing uninstall-agent in *any* circumstances except manual machine
>>> provisioning, we have had *way* too many problems with this "clever"
>>> feature being invoked when it shouldn't be.
>>>
>>
>> I don't want to belabour the point, but just to be clear: the
>> uninstall-agent file exists to record the fact that he machine is in fact
>> Dead, and uninstall should go ahead. That logic was put in specifically to
>> prevent the referenced bug. We can and should improve it to only do this
>> for manual machines.
>>
>
> Meta: a name like "uninstall-agent" is really misleading if it actually
> means "machine-X is definitely dead". I never had the slightest indication
> that was what it was meant to mean. But in that case, why *don't* we write
> the file on certain API connection failures? There is logic that
> *explicitly* checks if the machine is Dead -- surely that's a trigger too?
>

We do, but it looks like we've regressed since the MADE branch landed.
api.ErrConnectImpossible now causes uninstall-agent, which means a
(possibly temporary) authorization error will once again cause an uninstall:

https://github.com/juju/juju/blob/master/worker/apicaller/connect.go#L176

FWIW, the loop stuff can be dropped when the LXC container support is
>>>> removed. Nobody ever added support for loop in the LXD provider, and I
>>>> think we should implement support for it differently to how it was done for
>>>> LXC anyway (losetup on host, expose to container; as opposed to expose all
>>>> loop devices to all LXD containers and losetup in container).
>>>>
>>>
>>> +1000 to that. So... can't we just (1) fix the manual provisioning to
>>> write the file; (2) drop all other use of uninstall-agent; (3) drop the
>>> lxc-specific logic in uninstall -- and then we're done?
>>>
>>
>> For first steps, I think so. But I do think we should fix
>> https://bugs.launchpad.net/juju-core/+bug/1514874 for manual machines as
>> well. So:
>>
>>  (1) record (in agent config?) that a machine is manual
>>  (2) only ever do anything uninstall-related for manual machines
>>  (3) only ever do uninstall-related things if the machine actually is Dead
>>  (4) drop lxc-specific logic from uninstall *when LXC support is removed*
>>
>
> Generally SGTM, but confused re (4) -- doesn't that have to be
> fixed/moved/removed first earlier, so that we can switch off the suicide
> behaviour in other cases without regressing?
>

We can remove the LXC/loop bits now if we're fine with leaking loop
devices, which is probably OK assuming we are actually removing LXC before
2.0. Loop devices has to be explicitly enabled anyway.

Cheers,
Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160509/6ccba429/attachment.html>