High Availability command line interface - future plans.

Wed Nov 6 19:20:32 UTC 2013

just my tuppence...

Would it not be clearer to add an additional command to implement your
proposal? E.g. "add-manager" and possibly "destroy/remove-manager"
This could also support switches for later fine control, and possibly
be less open to misinterpretation than overloading the add-machine
command?

Nick

On Wed, Nov 6, 2013 at 6:49 PM, roger peppe <rogpeppe at gmail.com> wrote:
> The current plan is to have a single "juju ensure-ha-state" juju
> command. This would create new state server machines if there are less
> than the required number (currently 3).
>
> Taking that as given, I'm wondering what we should do
> in the future, when users require more than a single
> big On switch for HA.
>
> How does the user:
>
> a) know about the HA machines so the costs of HA are not hidden, and that
> the implications of particular machine failures are clear?
>
> b) fix the system when a machine dies?
>
> c) scale up the system to x thousand nodes?
>
> d) scale down the system?
>
> For a), we could tag a machine in the status as a "state server", and
> hope that the user knows what that means.
>
> For b) the suggestion is that the user notice that a state server machine
> is non-responsive (as marked in status) and runs destroy-machine on it,
> which will notice that it's a state server machine and automatically
> start another one to replace it. Destroy-machine would refuse to work
> on a state server machine that seems to be alive.
>
> For c) we could add a flag to ensure-ha-state suggesting a desired number
> of state-server nodes.
>
> I'm not sure what the suggestion is for d) given that we refuse to
> destroy live state-server machines.
>
> Although ensure-ha-state might be a fine way to turn
> on HA initially I'm not entirely happy with expanding it to cover
> all the above cases. It seems to me like we're going
> to create a leaky abstraction that purports to be magic ("just wave the
> HA wand!") and ends up being limiting, and in some cases confusing
> ("Huh? I asked to destroy that machine and there's another one
> just been created")
>
> I believe that any user that's using HA will need to understand that
> some machines are running state servers, and when things fail, they
> will need to manage those machines individually (for example by calling
> destroy-machine).
>
> I also think that the solution to c) is limiting, because there is
> actually no such thing as a "state server" - we have at least three
> independently scalable juju components (the database servers (mongodb),
> the API servers and the environment managers) with different scaling
> characteristics. I believe that in any sufficiently large environment,
> the user will not want to scale all of those at the same rate. For example
> MongoDB will allow at most 12 members of a replica set, but a caching API
> server could potentially usefully scale up much higher than that. We could
> add more flags to ensure-ha-state (e.g.--state-server-count) but we then
> we'd lack the capability to suggest which might be grouped with which.
>
> PROPOSAL
>
> My suggestion is that we go for a "slightly less magic" approach.
> that provides the user with the tools to manage
> their own high availability set up, adding appropriate automation in time.
>
> I suggest that we let the user know that machines can run as juju server
> nodes, and provide them with the capability to *choose* which machines
> will run as server nodes and which can host units - that is, what *jobs*
> a machine will run.
>
> Here's a possible proposal:
>
> We already have an "add-machine" command. We'd add a "--jobs" flag
> to allow the user to specify the jobs that the new machine(s) will
> run. Initially we might have just two jobs, "manager" and "unit"
> - the machine can either host service units, or it can manage the
> juju environment (including running the state server database),
> or both. In time we could add finer levels of granularity to allow
> separate scalability of juju server components, without losing backwards
> compatibility.
>
> If the new machine is marked as a "manager", it would run a mongo
> replica set peer. This *would* mean that it would be possible to have
> an even number of mongo peers, with the potential for a split vote
> if the nodes were partitioned evenly, and resulting database stasis.
> I don't *think* that would actually be a severe problem in practice.
> We would make juju status point out the potential problem very clearly,
> just as it should point out the potential problem if one of an existing
> odd-sized replica set dies. The potential problems are the same in both
> cases, and are straightforward for even a relatively naive user to avoid.
>
> Thus, juju ensure-ha-state is almost equivalent to:
>
>     juju add-machine --jobs manager -n 2
>
> In my view, this command feels less "magic" than ensure-ha-state - the
> runtime implication (e.g. cost) of what's going on are easier for the
> user to understand and it requires no new entities in a user's model of
> the system.
>
> In addition to the new add-machine flag, we'd add a single new command,
> "juju machine-jobs", which would allow the user to change the jobs
> associated with an existing machine.  That could be a later addition -
> it's not necessary in the first cut.
>
> With these primitives, I *think* the responsibilities of the system and
> the model to the user become clearer.  Looking back to the original
> user questions:
>
> a) The "state manager" status of certain machines in the status is no
> longer something entirely divorced from user control - it means something
> in terms of the commands the user is provided with.
>
> b) The user already knows about destroy-machine.  They can manage broken
> state manager machines just as they would manage any other broken machine.
> Destroy-machine would refuse to destroy the any state server machine that
> would take the currently connected set of mongo peers below a majority.
>
> c) We already have add-machine.
>
> d) We already have destroy-machine. See c) above.
>
> REQUEST FOR COMMENTS
>
> If there is broad agreement on the above, then I propose that
> we start off by implementing ensure-ha-state with as little
> internal logic as possible - we don't necessarily need to put the transactional
> logic in to make sure it works without starting extra machines
> when many people are calling ensure-ha-state concurrently, for example.
>
> In fact, ensure-ha-state could probably be written as a thin layer
> on top of add-machine --jobs.
>
> Thoughts?
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev