[Maas-devel] RFC: "Serialising" power actions

Mon Sep 15 21:34:33 UTC 2014

Hi all,

I'm handling the work to "serialise" power actions — at least, I'm getting
started on it right now. I've  spent some time looking at the problem and I
wanted to bounce ideas off you all — preferably whilst I sleep :)

So, the problem:

 When a power action is issued to a node (power on, power off, etc.), more
than one can be in play for a node at once. We don't keep track of them
once they've been fired, except for receiving a notification when they've
been successful or failed.

This means that it's possible to issue two conflicting commands (e.g. power
on followed by power off) in quick succession, which can then leave the
node in an odd state:  it's theoretically possible that the node would stay
powered on when MAAS expects it to be off, say if for some reason the power
off command got executed first — this is even more likely with AMT BMCs,
since there's a degree of did-I-cast-the-runes-right to get a command to
work on those, at least when the moon is waning and the wind is from the
east.

There are, so far as I can tell, two strategies for handling this problem
properly. Both of them require keeping track of the current power action
for a node, and both assume that only one action can run at once:

1: The current power action blocks all others until it as completed. Other
power actions will be queued and executed in turn.
- or -
2: Each power action supersedes any action that is currently executing —
the existing action is cancelled and then the new action is run.
- or -
3. We track the current ("now") and "next" actions for the node, but drop
every action that comes in once those two slots are full.

At first glance the second option is simpler — just cancel whatever's there
and then do our thing. But I think that it's actually a bit deceptive.
Consider:

 - How do we "cancel" an action?
 - How do we ensure that we're not going to end up in an inconsistent state
if the node is already responding to action #1 when we cancel it?

The first option isn't without its problems either — having a queue of
actions seems kind of awkward, and could lead to flip-flopping of a node's
power state. But *not* having a queue could still lead to situations where
several actions get  issued in quick succession.

The third option seems to offer a happy medium. We can track the current
and next power actions for a node and then ignore anything else that comes
in whilst both of those two slots are full. Each action must succeed or
fail before the next one can be executed. This means we won't get
potentially ridiculous amounts of flip-flopping, and we build this pretty
easily. We'd have to have some kind of UI feedback for "hey, it looks like
you're repeatedly powering this node on and off; I'm going to ignore you
for a while," but that doesn't seem all that onerous.

So as it stands I'm leaning towards option #3. Questions, thoughts
and comments are welcome.

~gmb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/maas-devel/attachments/20140915/5281fcac/attachment.html>