Fwd: High Availability command line interface - future plans.

Fri Nov 8 18:06:02 UTC 2013

Opps, I only responded to William rather than the whole group.

---------- Forwarded message ----------
From: Mark Canonical Ramm-Christensen <mark.ramm-christensen at canonical.com>
Date: Sat, Nov 9, 2013 at 2:02 AM
Subject: Re: High Availability command line interface - future plans.
To: William Reade <william.reade at canonical.com>

On Sat, Nov 9, 2013 at 12:00 AM, William Reade
<william.reade at canonical.com>wrote:

> I'm concerned that we're (1) rehashing decisions made during the sprint
> and (2) deviating from requirements in doing so.
>
> I'm sure that this is a valid concern.

My concern was that we were wandering back into territory oft discussed in
the past (juju deploy juju) because the "ensure-ha" spelling was both
not-expressive enough, and too expressive of an idea that we aren't making
true (that we will fix HA for you even in the future).

In particular, abstracting HA away into "management" manipulations -- as
> roger notes, pretty much isomorphic to the "jobs" proposal -- doesn't give
> users HA so much as it gives them a limited toolkit with which they can
> more-or-less construct their own HA; in particular, allowing people to use
> an even number of state servers is strictly a bad thing [0], and I'm
> extremely suspicious of any proposal that opens that door.
>

I'd be fine with:

juju make-me-ha

as a better (officially approved) way of saying:

juju add-state-server -n 2

Heck, I'd even be fine with implementing make-me-ha as being the "one true
way" to do HA with us not even implementing add-state-server yet.

I do however think the idea of being able to place state-servers as you go
is pretty useful, and folks setting up HA can be expected to read
documentation, and when the don't to be reminded that they have done
something bad when they type the command, and again every time they run
status (and potentially every time they open up the GUI).

> Of course, some will argue that mongo should be able to scale separately
> from the api servers and other management tasks, and this is a worthy goal;
> but in this context it sucks us down into the morass of exposing different
> types of management on different machines, and ends up approaching the jobs
> proposal still closer, in that it requires users to assimilate a whole load
> of extra terminology in order to perform a conceptually simple function.
>

Yea, that complexity in the --jobs proposal is one thing I want to avoid --
but even more so it is the fact that the complexity is hidden in a very odd
place (behind add-machine) which is quite unintuitive.

Conversely, "ensure-ha" (with possible optional --redundancy=N flag,
> defaulting to 1) is a simple model that can be simply explained: the
> command's sole purpose is to ensure that juju management cannot fail as a
> result to the simultaneous failure of <=N machines. It's a *user-level*
> construct that will always be applicable even in the context of a more
> sophisticated future language (no matter what's going on with this
> complicated management/jobs business, you can run that and be assured
> you'll end up with at least enough manager machines to fulfil the
> requirement you clearly stated in the command line).
>

I think the problem here is that it is felt to be "too magical" and to
express something we can't really promise -- that HA is ensured.   But just
chainging ensure to set makes that second point go away.

Kapil was brought up the point that a complicated syntax for HA was bad,
but that it would be "even worse" to "assume it always just works in a
blackbox."

I haven't seen anything that makes me think that redesigning from scratch
> is in any way superior to refining what we already agreed upon; and it's
> distracting us from the questions of reporting and correcting manager
> failure when it occurs. I assert the following series of arguments:
>
> * users may discover at any time that they need to make an existing
> environment HA, so ensure-ha is *always* a reasonable user action
>

But there are different ways of going about it given a desire for density...

> * similarly, allowing users to *directly* destroy management machines
> enables exciting new failure modes that don't really need to exist
>

Well, if you are at 4 nodes of state-servers there are two ways to get into
a "good state" adding another mongo or removing one ;)   Both allow failure
modes.

>
> * the notion of HA is somewhat limited in worth when there's no way to
> make a vulnerable environment robust again
>

I think all the proposals offer a way to bring back HA, so I think on this
at least everybody is in agreement.

 * the more complexity we shovel onto the user's plate, the less likely she
> is to resolve the situation correctly under stress
>

Agreed.   And make-ha is perhaps a good spelling for that where it makes
sure you have exactly 3 machines.

> * the most obvious, and foolproof, command for repairing HA would be
> "ensure-ha" itself, which could very reasonably take it upon itself to
> replace manager nodes detected as "down" -- assuming a robust presence
> implementation, which we need anyway, this (1) works trivially for machines
> that die unexpectedly and (2) allows a backdoor for resolution of "weird"
> situations: the user can manually shutdown a misbehaving manager
> out-of-band, and run ensure-ha to cause a new one to be spun up in its
> place; once HA is restored, the old machine will no longer be a manager, no
> longer be indestructible, and can be cleaned up at leisure
>

I think this can all be done manually for now -- we don't automate it for
services, or have any of the primitives for HA that we would need.

>
> I don't believe that any of this precludes more sophisticated management
> of juju's internal functions *when* the need becomes pressing -- whether
> via jobs, or namespaced pseudo-services, or whatever -- but at this stage I
> think it is far better to expose the policies we're capable of supporting,
> and thus allow ourselves wiggle room to allow the mechanism to evolve, than
> to define a user-facing model that is, at best, a woolly reflection of an
> internal model that's likely to change as we explore the solution space in
> practice.
>

Cool, I think that make-ha sounds both idempotent and transient (it can be
safely run multiple times, and doesn't imply that it is *watching and
updating things dynamically* so I'm all for it.

And I think the need for add-state-server is predicated on a desire to make
the manual bits easier to handle -- and I'd rather do that that
add-machines --jobs -- but doing the high level thing that just does the
right thing is better yet in my opinion.

> Long-term, FWIW, I would be happiest to expose fine control over HA,
> scaling, etc by presenting juju's internal functionality as a namespaced
> group of services that *can* be configured and manipulated (as much as
> possible) like normal services, because... y'know... services/units is
> actually a pretty good user model; but I think we're all in agreement that
> we shouldn't go down that rabbit hole today.
>

I think there is broad agreement about the niceness of exposing it this way
even if it is implemented very differently under the hood.  I do think that
it is important not go down that road until internal functionality actually
acts like units in enough ways that it is unlikely that they will then
immediately move on to ask why it behaves differently from all other units
;)

That said, after a bit more thinking this evening on the train to Hong Kong
I'm not sure the list of issues Roger raised is actually likely to cause
user problems:

The difficulty I have with 1) is that there's a significant mismatch between
> the user's view of things and what's going on underneath.
> For instance, with a built-in service, can I:
>
> - add a subordinate service to it?
> - see the relevant log file in the usual place for a unit?
> - see its charm metadata?
> - join to its juju-info relation?
>
> If it's a single service, how can its units span different series?
> (presumably it has got a charm URL, which includes the series)
>
> I don't think users are generally going to have much of any expectations
about the above five points... but I don't think they matter -- still
something feels icky and I'm not completely sold on doing the "add-unity
state-server --to=lxc5/1".

The more I thought about it all evening the less convinced I was that
calling this "add-unit state-server" (even though that is not AT ALL how it
is implemented) was a terrible idea.   The --to and --constraints syntax
apply in both cases, and I'm not sure that "normal" users have a crisp idea
of what a "unit" is and probably have no expectations about the 5
mismatches listed above -- at least not until they were pretty far down the
rabbit hole.

--Mark Ramm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20131109/6fc391e6/attachment-0001.html>