[Maas-devel] RFC: "Serialising" power actions

Thu Sep 18 12:33:00 UTC 2014

On 18 September 2014 00:27, Julian Edwards <julian.edwards at canonical.com> wrote:
> On Wednesday 17 Sep 2014 09:58:47 Gavin Panella wrote:
...
> I already explained a few times why it's undesirable.  I'm not being funny,
> but do you have anything concrete to refute my points other than "I think it's
> undesirable" ?
>
>>
>> >  * you cannot rely on cancellation of an outstanding operation (in
>> >
>> > what state would it leave the machine?)
>>
>> Only cancel an in-progress task when there's something to supersede it,
>> and when the final desired state of the superseder is different to the
>> in-progress task.
>
> As I explained in the previous email, cancelling can miss important state
> changes that need to happen and you have no idea if it really cancelled or
> not.

I did miss that power-on is important because it also means booting,
which is especially important for a transitional status like DEPLOYING.
But I think it's okay to say that, if we want the power off, we want it
off nowish, and we're happy to override whatever is in progress to get
there.

...
>> For DEPLOYED nodes, sure, the command will currently be lost, but
>> these nodes are, one assumes, under active management, and some
>> process outside of MAAS will notice, be that a human or a Juju or
>> something else.
>
> I think that's a dreadful user experience. We should not be knowingly
> throwing away user requests.

It's not perfect, but it's far from dreadful.

MAAS can never provide a perfect service. Even if we coded everything
perfectly, it would still suffer from failures outside of its control,
like hardware failure, poor quality BMC firmware, human error, fire,
flood, war, and so on. Not resuming power commands after a crash or
restart of a cluster is a failure mode that we can iterate on and
reduce, but it's not the end of days.

...
>> We can infer the desired power state from statuses:
>>
>>     NEW = off
>>     COMMISSIONING = on
>>     FAILED_COMMISSIONING = off
>>     MISSING = ? (unused status, afaik)
>>     READY = off
>>     RESERVED = off
>>     ALLOCATED = off
>>     RETIRED = ? (unused status, afaik)
>>     BROKEN = off
>>     DEPLOYING = on
>>     DEPLOYED = not our business
>>     FAILED_DEPLOYMENT = off
>
> DEPLOYED really is our business though.
>
> You cannot infer desired power state from the statuses in a sane
> manner, it's overloading the meaning of status. For example, some of
> the states can arise through failures and with your scheme it implies
> we need to turn them off, which is not always going to be desirable
> (what if someone needs to do live debugging?)
>
> IOW, why imply it when you can be explicit about it.

We can implement desired-power-state as a direct inference from status
or by using additional intermediate steps, it doesn't really matter. My
point is that the desired power state is quite strongly related to the
node's place in its lifecycle, to the point where we don't need to store
a separate desired-power-state.

However, you think that we should be in control of power state when a
node is deployed, and I can see we'd need a field for that, but I think
the consensus so far is not in favour of that.

For BROKEN nodes, where we want to allow debugging, we can say something
like "power-off at first, then expect it to be off, but don't enforce".
Whereas a machine the READY state would be "expect it to be off, and
enforce".