[Maas-devel] RFC: "Serialising" power actions

Julian Edwards julian.edwards at canonical.com
Mon Sep 15 23:40:34 UTC 2014


TL;DR, we should not be serialising at all.  See below.

On Monday 15 Sep 2014 22:34:33 Graham Binns wrote:
> Hi all,
> 
> I'm handling the work to "serialise" power actions — at least, I'm getting
> started on it right now. I've  spent some time looking at the problem and I
> wanted to bounce ideas off you all — preferably whilst I sleep :)
> 
> So, the problem:
> 
>  When a power action is issued to a node (power on, power off, etc.), more
> than one can be in play for a node at once. We don't keep track of them
> once they've been fired, except for receiving a notification when they've
> been successful or failed.
> 
> This means that it's possible to issue two conflicting commands (e.g. power
> on followed by power off) in quick succession, which can then leave the
> node in an odd state:  it's theoretically possible that the node would stay
> powered on when MAAS expects it to be off, say if for some reason the power
> off command got executed first — this is even more likely with AMT BMCs,
> since there's a degree of did-I-cast-the-runes-right to get a command to
> work on those, at least when the moon is waning and the wind is from the
> east.
> 
> There are, so far as I can tell, two strategies for handling this problem
> properly. Both of them require keeping track of the current power action
> for a node, and both assume that only one action can run at once:
> 
> 1: The current power action blocks all others until it as completed. Other
> power actions will be queued and executed in turn.
> - or -
> 2: Each power action supersedes any action that is currently executing —
> the existing action is cancelled and then the new action is run.
> - or -
> 3. We track the current ("now") and "next" actions for the node, but drop
> every action that comes in once those two slots are full.
> 
> At first glance the second option is simpler — just cancel whatever's there
> and then do our thing. But I think that it's actually a bit deceptive.
> Consider:
> 
>  - How do we "cancel" an action?
>  - How do we ensure that we're not going to end up in an inconsistent state
> if the node is already responding to action #1 when we cancel it?
> 
> The first option isn't without its problems either — having a queue of
> actions seems kind of awkward, and could lead to flip-flopping of a node's
> power state. But *not* having a queue could still lead to situations where
> several actions get  issued in quick succession.
> 
> The third option seems to offer a happy medium. We can track the current
> and next power actions for a node and then ignore anything else that comes
> in whilst both of those two slots are full. Each action must succeed or
> fail before the next one can be executed. This means we won't get
> potentially ridiculous amounts of flip-flopping, and we build this pretty
> easily. We'd have to have some kind of UI feedback for "hey, it looks like
> you're repeatedly powering this node on and off; I'm going to ignore you
> for a while," but that doesn't seem all that onerous.
> 
> So as it stands I'm leaning towards option #3. Questions, thoughts
> and comments are welcome.
> 
> ~gmb

Bear in mind that I am about to land a branch that stops nodes going to READY 
until we get an ack from the power_off stuff, so it will stop *some* of the 
tomfoolery.

IMO we need both #1 and a sort of #3 above (which makes #2 moot).

We need a way to *recover* operations in the case of a pserv and or region 
failure, and to do this the database needs to store the *desired* state of the 
power in addition to its current state.  As I have previously said, the pserv 
needs to issue a "recovery" call to the region when it restarts so it can 
converge on the desired state; for power the region would send back a list of 
outstanding power ops on nodes and the desired state for each.

If we're careful about node state management, we can prevent issuing more than 
one power command at a time by preventing it unless the:
 * power op has completed
 * the node state is in a state that allows the op

Let's discuss this when you start today.




More information about the Maas-devel mailing list