<br><br>On Tuesday, September 16, 2014, Julian Edwards <<a href="mailto:julian.edwards@canonical.com">julian.edwards@canonical.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">TL;DR, we should not be serialising at all. See below.<br> <br> Bear in mind that I am about to land a branch that stops nodes going to READY<br> until we get an ack from the power_off stuff, so it will stop *some* of the<br> tomfoolery.<br> <br> IMO we need both #1 and a sort of #3 above (which makes #2 moot).<br> <br> We need a way to *recover* operations in the case of a pserv and or region<br> failure, and to do this the database needs to store the *desired* state of the<br> power in addition to its current state. As I have previously said, the pserv<br> needs to issue a "recovery" call to the region when it restarts so it can<br> converge on the desired state; for power the region would send back a list of<br> outstanding power ops on nodes and the desired state for each.<br> <br> If we're careful about node state management, we can prevent issuing more than<br> one power command at a time by preventing it unless the:<br> * power op has completed<br> * the node state is in a state that allows the op<br> <br> Let's discuss this when you start today.</blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br></blockquote><br><div>So, we didn't get to talk about this in detail this morning, but after Gavin's email here's how I see the lay of the land:</div><div><br></div><div> - We definitely want the cluster recovery feature</div><div> - We also want what I'm going to call — for want of a better word — atomic power actions</div><div> - (Which is in fact a subset of the work to make recovery just work)</div><div> - Recovery is more work than we can realistically do for 1.7</div><div> - That said, Gavin's solution for atomic power actions is not *terribly* expensive (see below for a summary of that) and we can do it in time for 1.7 ("we" here is me + some help from the Gavinator as needed).</div><div><br></div>I spent a day poking around and thinking about how to do recovery, and I *defintely* want to see it done — and I'd love to take the lead on it too, whilst we're on the subject — but I don't think we need to do all of it right now to get to a place where power actions are far more reliable.<div><br></div><div><br></div><div>Gavin's solution</div><div>================</div><div><br></div><div> - Power actions are implemented as classes</div><div> - Each power action implements a cancel() method</div><div> - We keep a registry in PServ of the power actions in play or waiting for a given node</div><div> - When a new power action is issued for a node it checks with the other power actions in the registry for that node:</div><div> - If it is the same as an action currently in the registry, it discards itself; this is a way of ensuring that we don't end up with a long queue of flip-floppy actions.</div><div> - If it supersedes the action currently in play* it calls the cancel() method on that action and waits for it to be cancelled before running (since power actions' cancel() methods will probably be decorated with @inlineCallbacks this can simply be a case of the new action adding itself as a callback).</div><div> - If it needs to wait for the action in play to finish** it does so, again probably by adding itself as a callback somewhere.</div><div><br></div><div>If we implement this we get:</div><div><br></div><div> - One power action at a time per node</div><div> - Cancellation of superseded actions</div><div> - Coalescing of identical actions (no more flip-flopping)</div><div><br></div><div>I've had a good look at the code this evening and I'm happy to start hacking on this in the morning unless there are strong objections — Word of God appreciated, please.</div><div><br></div><div>*It may be that we want all actions to supersede what's in play, but for my purposes here I'm saying that power off > power on</div><div>**I had a really good idea as to why this was necessary when I started writing this email, but I can't remember it now</div><div><br></div>