High Availability command line interface - future plans.

Sat Nov 9 00:16:21 UTC 2013

It doesn't feel like the difference between

    juju ensure-ha --prefer-machines 11,37

and

    juju add-state-server --to 11,37

is worth the amount of reasoning there.  I'm clearly in favor of the
latter, but I wouldn't argue so much for it.

On Fri, Nov 8, 2013 at 2:00 PM, William Reade
<william.reade at canonical.com> wrote:
> I'm concerned that we're (1) rehashing decisions made during the sprint and
> (2) deviating from requirements in doing so.
>
> In particular, abstracting HA away into "management" manipulations -- as
> roger notes, pretty much isomorphic to the "jobs" proposal -- doesn't give
> users HA so much as it gives them a limited toolkit with which they can
> more-or-less construct their own HA; in particular, allowing people to use
> an even number of state servers is strictly a bad thing [0], and I'm
> extremely suspicious of any proposal that opens that door.
>
> Of course, some will argue that mongo should be able to scale separately
> from the api servers and other management tasks, and this is a worthy goal;
> but in this context it sucks us down into the morass of exposing different
> types of management on different machines, and ends up approaching the jobs
> proposal still closer, in that it requires users to assimilate a whole load
> of extra terminology in order to perform a conceptually simple function.
>
> Conversely, "ensure-ha" (with possible optional --redundancy=N flag,
> defaulting to 1) is a simple model that can be simply explained: the
> command's sole purpose is to ensure that juju management cannot fail as a
> result to the simultaneous failure of <=N machines. It's a *user-level*
> construct that will always be applicable even in the context of a more
> sophisticated future language (no matter what's going on with this
> complicated management/jobs business, you can run that and be assured you'll
> end up with at least enough manager machines to fulfil the requirement you
> clearly stated in the command line).
>
> I haven't seen anything that makes me think that redesigning from scratch is
> in any way superior to refining what we already agreed upon; and it's
> distracting us from the questions of reporting and correcting manager
> failure when it occurs. I assert the following series of arguments:
>
> * users may discover at any time that they need to make an existing
> environment HA, so ensure-ha is *always* a reasonable user action
> * users who *don't* need an HA environment can, by definition, afford to
> take the environment down and reconstruct it without HA if it becomes
> unimportant
> * therefore, scaling management *down* is not the highest priority for us
> (but is nonetheless easily amenable to future control via the "ensure-ha"
> command -- just explicitly set a lower redundancy number)
> * similarly, allowing users to *directly* destroy management machines
> enables exciting new failure modes that don't really need to exist
>
> * the notion of HA is somewhat limited in worth when there's no way to make
> a vulnerable environment robust again
> * the more complexity we shovel onto the user's plate, the less likely she
> is to resolve the situation correctly under stress
> * the most obvious, and foolproof, command for repairing HA would be
> "ensure-ha" itself, which could very reasonably take it upon itself to
> replace manager nodes detected as "down" -- assuming a robust presence
> implementation, which we need anyway, this (1) works trivially for machines
> that die unexpectedly and (2) allows a backdoor for resolution of "weird"
> situations: the user can manually shutdown a misbehaving manager
> out-of-band, and run ensure-ha to cause a new one to be spun up in its
> place; once HA is restored, the old machine will no longer be a manager, no
> longer be indestructible, and can be cleaned up at leisure
>
> * the notion is even more limited when you can't even tell when something
> goes wrong
> * therefore, HA state should *at least* be clearly and loudly communicated
> in status
> * but that's not very proactive, and I'd like to see a plan for how we're
> going to respond to these situations when we detect them
>
> * the data accessible to a manager node is sensitive, and we shouldn't
> generally be putting manager nodes on dirty machines; but density is an
> important consideration, and I don't think it's confusing to allow
> "preferred" machines to be specified in "ensure-ha", such that *if*
> management capacity needs to be added it will be put onto those machines
> before finding clean ones or provisioning new ones
> * strawman syntax: "juju ensure-ha --prefer-machines 11,37" to place any
> additional manager tasks that may be required on the supplied machines in
> order of preference -- but even this falls far behind the essential goal,
> which is "make HA *easy* for our users".
> * (ofc, we should continue not to put units onto manager machines by
> default, but allow them when forced with --to as before)
>
> I don't believe that any of this precludes more sophisticated management of
> juju's internal functions *when* the need becomes pressing -- whether via
> jobs, or namespaced pseudo-services, or whatever -- but at this stage I
> think it is far better to expose the policies we're capable of supporting,
> and thus allow ourselves wiggle room to allow the mechanism to evolve, than
> to define a user-facing model that is, at best, a woolly reflection of an
> internal model that's likely to change as we explore the solution space in
> practice.
>
> Long-term, FWIW, I would be happiest to expose fine control over HA,
> scaling, etc by presenting juju's internal functionality as a namespaced
> group of services that *can* be configured and manipulated (as much as
> possible) like normal services, because... y'know... services/units is
> actually a pretty good user model; but I think we're all in agreement that
> we shouldn't go down that rabbit hole today.
>
> Cheers
> William
>
>
> [0] consider the case of 4 managers; as with 3, if any single machine goes
> down the system will continue to function, but will fail once the second
> dies; but the situation is strictly worse because the number of machines
> that *could* fail, and thus trigger a vulnerable situation, is larger.
>
>
> On Fri, Nov 8, 2013 at 11:31 AM, John Arbash Meinel <john at arbash-meinel.com>
> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 2013-11-08 14:15, roger peppe wrote:
>> > On 8 November 2013 08:47, Mark Canonical Ramm-Christensen
>> > <mark.ramm-christensen at canonical.com> wrote:
>> >> I have a few high level thoughts on all of this, but the key
>> >> thing I want to say is that we need to get a meeting setup next
>> >> week for the solution to get hammered out.
>> >>
>> >> First, conceptually, I don't believe the user model needs to
>> >> match the implementation model.  That way lies madness -- users
>> >> care about the things they care about and should not have to
>> >> understand how the system works to get something basic done.
>> >> See:
>> >> http://www.amazon.com/The-Inmates-Are-Running-Asylum/dp/0672326140
>> >> for reasons why I call this madness.
>> >>
>> >> For that reason I think the path of adding a --jobs flag to
>> >> add-machine is not a move forward.  It is exposing implementation
>> >> detail to users and forcing them into a more complex conceptual
>> >> model.
>> >>
>> >> Second, we don't have to boil the ocean all at once. An
>> >> "ensure-ha" command that sets up additional server nodes is
>> >> better than what we have now -- nothing.  Nate is right, the box
>> >> need not be black, we could have an juju ha-status command that
>> >> just shows the state of HA.   This is fundamentally different
>> >> than changing the behavior and meaning of add-machines to know
>> >> about juju jobs and agents and forcing folks to think about
>> >> that.
>> >>
>> >> Third, we I think it is possible to chart a course from ensure-ha
>> >> as a shortcut (implemented first) to the type of syntax and
>> >> feature set that Kapil is talking about.  And let's not kid
>> >> ourselves, there are a bunch of new features in that proposal:
>> >>
>> >> * Namespaces for services * support for subordinates to state
>> >> services * logging changes * lifecycle events on juju "jobs" *
>> >> special casing the removal of services that would kill the
>> >> environment * special casing the stats to know about HA and warn
>> >> for even state server nodes
>> >>
>> >> I think we will be adding a new concept and some new syntax when
>> >> we add HA to juju -- so the idea is just to make it easier for
>> >> users to understand, and to allow a path forward to something
>> >> like what Kapil suggests in the future.   And I'm pretty solidly
>> >> convinced that there is an incremental path forward.
>> >>
>> >> Fourth, the spelling "ensure-ha" is probably not a very good
>> >> idea, the cracks in that system (like taking a -n flag, and
>> >> dealing with failed machines) are already apparent.
>> >>
>> >> I think something like Nick's proposal for "add-manager" would be
>> >> better. Though I don't think that's quite right either.
>> >>
>> >> So, I propose we add one new idea for users -- a state-server.
>> >>
>> >> then you'd have:
>> >>
>> >> juju management --info juju management --add juju management
>> >> --add --to 3 juju management --remove-from
>> >
>> > This seems like a reasonable approach in principle (it's
>> > essentially isomorphic to the --jobs approach AFAICS which makes me
>> > happy).
>> >
>> > I have to say that I'm not keen on using flags to switch the basic
>> > behaviour of a command. The interaction between the flags can then
>> > become non-obvious (for example a --constraints flag might be
>> > appropriate with --add but not --remove-from).
>> >
>> > Ah, but your next message seems to go along with that.
>> >
>> > So, to couch your proposal in terms that are consistent with the
>> > rest of the juju commands, here's how I see it could look, in terms
>> > of possible help output from the commands:
>> >
>> > usage: juju add-management [options] purpose: Add Juju management
>> > functionality to a machine, or start a new machine with management
>> > functionality. Any Juju machine can potentially participate as a
>> > Juju manager - this command adds a new such manager. Note that
>> > there should always be an odd number of active management machines,
>> > otherwise the Juju environment is potentially vulnerable to
>> > network partitioning. If a management machine fails, a new one
>> > should be started to replace it.
>>
>> I would probably avoid putting such an emphasis on "any machine can be
>> a manager machine". But that is my personal opinion. (If you want HA
>> you probably want it on dedicated nodes.)
>>
>> >
>> > options: --constraints  (= ) additional machine constraints.
>> > Ignored if --to is specified. -e, --environment (= "local") juju
>> > environment to operate in --series (= "") the Ubuntu series of the
>> > new machine. Ignored if --to is specified. --to (="") the id of the
>> > machine to add management to. If this is not specified, a new
>> > machine is provisioned.
>> >
>> > usage: juju remove-management [options] <machine-id> purpose:
>> > Remove Juju management functionality from the machine with the
>> > given id. The machine itself is not destroyed. Note that if there
>> > are less than three management machines remaining, the operation of
>> > the Juju environment will be vulnerable to the failure of a single
>> > machine. It is not possible to remove the last management machine.
>> >
>>
>> I would probably also remove the machine if the only thing on it was
>> the management. Certainly that is how people want us to do "juju
>> remove-unit".
>>
>>
>> > options: -e, --environment (= "local") juju environment to operate
>> > in
>> >
>> > As a start, we could implement only the add-management command, and
>> > not implement the --to flag. That would be sufficient for our HA
>> > deliverable, I believe. The other features could be added in time
>> > or according to customer demand.
>>
>> The main problem with this is that it feels slightly too easy to add
>> just 1 machine and then not actually have HA (mongo stops allowing
>> writes if you have a 2-node cluster and lose one, right?)
>>
>> John
>> =:->
>>
>> >
>> >> I know this is not following the add-machine format, but I think
>> >> it would be better to migrate that to something more like this:
>> >>
>> >> juju machine --add
>> >
>> > If we are going to do that, I think we should probably change all
>> > the commands at once - consistency is good.
>> >
>> > If we do the above, could we drop "juju ensure-ha" entirely, given
>> > the fact that the above commands are both easier to implement (I
>> > think!) and more powerful?
>> >
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.13 (Cygwin)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>
>> iEYEARECAAYFAlJ8vYQACgkQJdeBCYSNAAMv7ACeJ7N8g5MeV3XE230/qjAcYE8m
>> kUgAoLrJ0L1vD9zzszwgFHgI8G/gomJO
>> =rl+3
>> -----END PGP SIGNATURE-----
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>

-- 

gustavo @ http://niemeyer.net