[Maas-devel] RFC: "Serialising" power actions

Tue Sep 16 23:43:53 UTC 2014

On Tuesday 16 Sep 2014 11:42:48 Gavin Panella wrote:
> On 15 September 2014 22:34, Graham Binns wrote:
> ...
> 
> > 1: The current power action blocks all others until it as completed. Other
> > power actions will be queued and executed in turn.
> > - or -
> > 2: Each power action supersedes any action that is currently executing —
> > the existing action is cancelled and then the new action is run.
> > - or -
> > 3. We track the current ("now") and "next" actions for the node, but drop
> > every action that comes in once those two slots are full.
> 
> I think #1 is wrong; apart from stress-testing I can't think of a
> situation where I'd want every panic-and-frustration-induced click of
> the power buttons in the UI to be recorded and acted upon. I just want
> it to do the last one.
> 
> #3 is like #2, but you have to wait for the currently executing command
> to finish. Boring!
> 
> I think #2 is the right starting point. In addition:

I disagree.  You'll be back in the scenario where important power ops are 
ignored (and is a bug that I just fixed).

For example, a machine is slow to power down and doesn't finish before a power 
up is issued. MAAS expects the node to have rebooted, but in fact it hasn't 
and at best will sit there not doing anything, at worst will reboot with old 
data.

Effectively you've issued two power commands both of which were ignored.

We *must* wait for an outstanding power op to finish one way or another, 
whether it fails or it succeeds.

> - If a power-on command is sent to a cluster, and the cluster is already
>   attempting to turn the node on, the command should be silently merged
>   with the existing command.
> 
> - Likewise for power-off.

*Maybe*, I don't think it should be silent, it should respond with something 
to indicate "ok I am already doing that, please wait".  Whether it's an error 
condition or not is arguable both ways.

> - It may be interesting to include a discriminator from a monotonically
>   increasing sequence with each power command. A power-off command that
>   is received by a cluster with a discriminator lower than a running
>   power-on command would be rejected.
> 
>   In practice I doubt this will make much difference, but it's worth
>   mentioning even if only to reject the idea.

I'm not sure what it achieves tbh, but I'm open to it if you convince me :)

> 
> > At first glance the second option is simpler — just cancel whatever's
> > there and then do our thing. But I think that it's actually a bit
> > 
> > deceptive. Consider:
> >  - How do we "cancel" an action?
> 
> Cancelling means keeping a reference to a cancel function in a shared
> location in the cluster. Shared state can make people feel dirty, but
> Twisted's single-thread model makes it pretty benign.

I don't think cancelling is something worth doing, or even possible.  I am yet 
to see any power controllers that allow you to cancel an in-progress power 
operation.

The only time you'd be able cancel one is if you queued it up, and I already 
explained why I think that's a bad idea.

> Doing this would also allow the region to ask the cluster things like
> "are you changing a node's power state?".

The region should know this already if it has an outstanding power operation 
that awaits a response.

> 
> >  - How do we ensure that we're not going to end up in an inconsistent
> >  state
> > 
> > if the node is already responding to action #1 when we cancel it?
> 
> The power control code has been improved to be less fire-and-forget and
> more fire-and-keep-firing-until-it-is-in-the-desired-state. Cancelling a
> power-on and starting a power-off should Just Work.

Again, we must wait for these to complete as they are not cancellable and can 
fail, which must result in a node failure.

J