[Maas-devel] RFC: "Serialising" power actions

Graham Binns graham.binns at canonical.com
Tue Sep 16 20:01:57 UTC 2014


On Tuesday, September 16, 2014, Julian Edwards <julian.edwards at canonical.com>
wrote:

> TL;DR, we should not be serialising at all.  See below.
>
> Bear in mind that I am about to land a branch that stops nodes going to
> READY
> until we get an ack from the power_off stuff, so it will stop *some* of the
> tomfoolery.
>
> IMO we need both #1 and a sort of #3 above (which makes #2 moot).
>
> We need a way to *recover* operations in the case of a pserv and or region
> failure, and to do this the database needs to store the *desired* state of
> the
> power in addition to its current state.  As I have previously said, the
> pserv
> needs to issue a "recovery" call to the region when it restarts so it can
> converge on the desired state; for power the region would send back a list
> of
> outstanding power ops on nodes and the desired state for each.
>
> If we're careful about node state management, we can prevent issuing more
> than
> one power command at a time by preventing it unless the:
>  * power op has completed
>  * the node state is in a state that allows the op
>
> Let's discuss this when you start today.


>
So, we didn't get to talk about this in detail this morning, but after
Gavin's email here's how I see the lay of the land:

 - We definitely want the cluster recovery feature
 - We also want what I'm going to call — for want of a better word — atomic
power actions
 - (Which is in fact a subset of the work to make recovery just work)
 - Recovery is more work than we can realistically do for 1.7
 - That said, Gavin's solution for atomic power actions is not *terribly*
expensive (see below for a summary of that) and we can do it  in time for
1.7 ("we" here is me + some help from the Gavinator as needed).

I spent a day poking around and thinking about how to do recovery, and I
*defintely* want to see it done — and I'd love to take the lead on it too,
whilst we're on the subject — but I don't think we need to do all of it
right now to get to a place where power actions are far more reliable.


Gavin's solution
================

 - Power actions are implemented as classes
 - Each power action implements a cancel() method
 - We keep a registry in PServ of the power actions in play or waiting for
a given node
 - When a new power action is issued for a node it checks with the other
power actions in the registry for that node:
   - If it is the same as an action currently in the registry, it discards
itself; this is a way of ensuring that we don't end up with a long queue of
flip-floppy actions.
   - If it supersedes the action currently in play* it calls the cancel()
method on that action and waits for it to be cancelled before running
(since power actions' cancel() methods will probably be decorated
with @inlineCallbacks this can simply be a case of the new action adding
itself as a callback).
  - If it needs to wait for the action in play to finish** it does so,
again probably by adding itself as a callback somewhere.

If we implement this we get:

 - One power action at a time per node
 - Cancellation of superseded actions
 - Coalescing of identical actions (no more flip-flopping)

I've had a good look at the code this evening and I'm happy to start
hacking on this in the morning unless there are strong objections — Word of
God appreciated, please.

*It may be that we want all actions to supersede what's in play, but for my
purposes here I'm saying that power off > power on
**I had a really good idea as to why this was necessary when I started
writing this email, but I can't remember it now
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/maas-devel/attachments/20140916/33f9fe76/attachment.html>


More information about the Maas-devel mailing list