[Maas-devel] RFC: "Serialising" power actions

Tue Sep 16 23:28:14 UTC 2014

On Tuesday 16 Sep 2014 11:22:08 Gavin Panella wrote:
> A way to recover operations in the cluster when it fails or is restarted
> is desirable, but it's something we can do next cycle, and it's out of
> scope for this cycle (and wasn't in scope anyway).

It's certainly out of scope now because we don't have time to get it done.  
However, failing to build this in from the start is a huge mistake, especially 
in the context of making MAAS robust in the face of failures.

> Celery with RabbitMQ, gave us /an/ ability to recover from a restart or
> a crash, assuming Celery uses AMQP settlement correctly (which I assume
> it probably does).
> 
> I don't think it gave us a way to prevent 100s of power-on and power-off
> tasks from being queued, it didn't give us a way to make immediate
> queries of clusters, and didn't give clusters a way to talk to the
> region.
> 
> So, with the move to RPC we've lost some of that ability for a cluster
> to pick up where it left off, but we have gained a lot more control over
> MAAS's infrastructure, and have a strong basis for layering more HA and
> HA-like features on top.
> 
> I agree that putting "truth" into the database and designing code to
> converge on that is the right approach.
> 
> (Aside: once a node has been deployed, MAAS can no longer have a desired
> power state to converge on. The node belongs to the user at that point,
> and he/she has the freedom to turn it off and on as needed, and he/she
> can do that by mechanisms other than by MAAS.)

Massively disagree.

We need to make a stand here and insist that MAAS controls all aspects of the 
Node, including its power.  MAAS must always know what state the node needs to 
be in, unless it is broken.

J