High Availability command line interface - future plans.

Nate Finch nate.finch at canonical.com
Fri Nov 8 16:12:48 UTC 2013


Scaling jobs independently doesn't really get you much.  If you need 7
machines of redundancy for mongo... why would you not just also want the
API on all 7 machines?  It's 100% upside... now your API is that much more
redundant/scaled, and we already know the API and mongo run just fine
together on a single machine.

The only point at which it makes sense to break out of "just make N copies
of the whole state server" is:


   1.  if you need to go beyond mongo's 12 node maximum, or
   2. if you want to somehow have HA without using up N extra machines by
   putting bits and pieces on machines also hosting units.


Neither of those seem like critical things we need to support in v1 of HA.
 And we should probably only try to do what is critical for v1.


On Fri, Nov 8, 2013 at 11:00 AM, William Reade
<william.reade at canonical.com>wrote:

> I'm concerned that we're (1) rehashing decisions made during the sprint
> and (2) deviating from requirements in doing so.
>
> In particular, abstracting HA away into "management" manipulations -- as
> roger notes, pretty much isomorphic to the "jobs" proposal -- doesn't give
> users HA so much as it gives them a limited toolkit with which they can
> more-or-less construct their own HA; in particular, allowing people to use
> an even number of state servers is strictly a bad thing [0], and I'm
> extremely suspicious of any proposal that opens that door.
>
> Of course, some will argue that mongo should be able to scale separately
> from the api servers and other management tasks, and this is a worthy goal;
> but in this context it sucks us down into the morass of exposing different
> types of management on different machines, and ends up approaching the jobs
> proposal still closer, in that it requires users to assimilate a whole load
> of extra terminology in order to perform a conceptually simple function.
>
> Conversely, "ensure-ha" (with possible optional --redundancy=N flag,
> defaulting to 1) is a simple model that can be simply explained: the
> command's sole purpose is to ensure that juju management cannot fail as a
> result to the simultaneous failure of <=N machines. It's a *user-level*
> construct that will always be applicable even in the context of a more
> sophisticated future language (no matter what's going on with this
> complicated management/jobs business, you can run that and be assured
> you'll end up with at least enough manager machines to fulfil the
> requirement you clearly stated in the command line).
>
> I haven't seen anything that makes me think that redesigning from scratch
> is in any way superior to refining what we already agreed upon; and it's
> distracting us from the questions of reporting and correcting manager
> failure when it occurs. I assert the following series of arguments:
>
> * users may discover at any time that they need to make an existing
> environment HA, so ensure-ha is *always* a reasonable user action
> * users who *don't* need an HA environment can, by definition, afford to
> take the environment down and reconstruct it without HA if it becomes
> unimportant
> * therefore, scaling management *down* is not the highest priority for us
> (but is nonetheless easily amenable to future control via the "ensure-ha"
> command -- just explicitly set a lower redundancy number)
> * similarly, allowing users to *directly* destroy management machines
> enables exciting new failure modes that don't really need to exist
>
> * the notion of HA is somewhat limited in worth when there's no way to
> make a vulnerable environment robust again
> * the more complexity we shovel onto the user's plate, the less likely she
> is to resolve the situation correctly under stress
> * the most obvious, and foolproof, command for repairing HA would be
> "ensure-ha" itself, which could very reasonably take it upon itself to
> replace manager nodes detected as "down" -- assuming a robust presence
> implementation, which we need anyway, this (1) works trivially for machines
> that die unexpectedly and (2) allows a backdoor for resolution of "weird"
> situations: the user can manually shutdown a misbehaving manager
> out-of-band, and run ensure-ha to cause a new one to be spun up in its
> place; once HA is restored, the old machine will no longer be a manager, no
> longer be indestructible, and can be cleaned up at leisure
>
> * the notion is even more limited when you can't even tell when something
> goes wrong
> * therefore, HA state should *at least* be clearly and loudly communicated
> in status
> * but that's not very proactive, and I'd like to see a plan for how we're
> going to respond to these situations when we detect them
>
> * the data accessible to a manager node is sensitive, and we shouldn't
> generally be putting manager nodes on dirty machines; but density is an
> important consideration, and I don't think it's confusing to allow
> "preferred" machines to be specified in "ensure-ha", such that *if*
> management capacity needs to be added it will be put onto those machines
> before finding clean ones or provisioning new ones
> * strawman syntax: "juju ensure-ha --prefer-machines 11,37" to place any
> additional manager tasks that may be required on the supplied machines in
> order of preference -- but even this falls far behind the essential goal,
> which is "make HA *easy* for our users".
> * (ofc, we should continue not to put units onto manager machines by
> default, but allow them when forced with --to as before)
>
> I don't believe that any of this precludes more sophisticated management
> of juju's internal functions *when* the need becomes pressing -- whether
> via jobs, or namespaced pseudo-services, or whatever -- but at this stage I
> think it is far better to expose the policies we're capable of supporting,
> and thus allow ourselves wiggle room to allow the mechanism to evolve, than
> to define a user-facing model that is, at best, a woolly reflection of an
> internal model that's likely to change as we explore the solution space in
> practice.
>
> Long-term, FWIW, I would be happiest to expose fine control over HA,
> scaling, etc by presenting juju's internal functionality as a namespaced
> group of services that *can* be configured and manipulated (as much as
> possible) like normal services, because... y'know... services/units is
> actually a pretty good user model; but I think we're all in agreement that
> we shouldn't go down that rabbit hole today.
>
> Cheers
> William
>
>
> [0] consider the case of 4 managers; as with 3, if any single machine goes
> down the system will continue to function, but will fail once the second
> dies; but the situation is strictly worse because the number of machines
> that *could* fail, and thus trigger a vulnerable situation, is larger.
>
>
> On Fri, Nov 8, 2013 at 11:31 AM, John Arbash Meinel <
> john at arbash-meinel.com> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 2013-11-08 14:15, roger peppe wrote:
>> > On 8 November 2013 08:47, Mark Canonical Ramm-Christensen
>> > <mark.ramm-christensen at canonical.com> wrote:
>> >> I have a few high level thoughts on all of this, but the key
>> >> thing I want to say is that we need to get a meeting setup next
>> >> week for the solution to get hammered out.
>> >>
>> >> First, conceptually, I don't believe the user model needs to
>> >> match the implementation model.  That way lies madness -- users
>> >> care about the things they care about and should not have to
>> >> understand how the system works to get something basic done.
>> >> See:
>> >> http://www.amazon.com/The-Inmates-Are-Running-Asylum/dp/0672326140
>> >> for reasons why I call this madness.
>> >>
>> >> For that reason I think the path of adding a --jobs flag to
>> >> add-machine is not a move forward.  It is exposing implementation
>> >> detail to users and forcing them into a more complex conceptual
>> >> model.
>> >>
>> >> Second, we don't have to boil the ocean all at once. An
>> >> "ensure-ha" command that sets up additional server nodes is
>> >> better than what we have now -- nothing.  Nate is right, the box
>> >> need not be black, we could have an juju ha-status command that
>> >> just shows the state of HA.   This is fundamentally different
>> >> than changing the behavior and meaning of add-machines to know
>> >> about juju jobs and agents and forcing folks to think about
>> >> that.
>> >>
>> >> Third, we I think it is possible to chart a course from ensure-ha
>> >> as a shortcut (implemented first) to the type of syntax and
>> >> feature set that Kapil is talking about.  And let's not kid
>> >> ourselves, there are a bunch of new features in that proposal:
>> >>
>> >> * Namespaces for services * support for subordinates to state
>> >> services * logging changes * lifecycle events on juju "jobs" *
>> >> special casing the removal of services that would kill the
>> >> environment * special casing the stats to know about HA and warn
>> >> for even state server nodes
>> >>
>> >> I think we will be adding a new concept and some new syntax when
>> >> we add HA to juju -- so the idea is just to make it easier for
>> >> users to understand, and to allow a path forward to something
>> >> like what Kapil suggests in the future.   And I'm pretty solidly
>> >> convinced that there is an incremental path forward.
>> >>
>> >> Fourth, the spelling "ensure-ha" is probably not a very good
>> >> idea, the cracks in that system (like taking a -n flag, and
>> >> dealing with failed machines) are already apparent.
>> >>
>> >> I think something like Nick's proposal for "add-manager" would be
>> >> better. Though I don't think that's quite right either.
>> >>
>> >> So, I propose we add one new idea for users -- a state-server.
>> >>
>> >> then you'd have:
>> >>
>> >> juju management --info juju management --add juju management
>> >> --add --to 3 juju management --remove-from
>> >
>> > This seems like a reasonable approach in principle (it's
>> > essentially isomorphic to the --jobs approach AFAICS which makes me
>> > happy).
>> >
>> > I have to say that I'm not keen on using flags to switch the basic
>> > behaviour of a command. The interaction between the flags can then
>> > become non-obvious (for example a --constraints flag might be
>> > appropriate with --add but not --remove-from).
>> >
>> > Ah, but your next message seems to go along with that.
>> >
>> > So, to couch your proposal in terms that are consistent with the
>> > rest of the juju commands, here's how I see it could look, in terms
>> > of possible help output from the commands:
>> >
>> > usage: juju add-management [options] purpose: Add Juju management
>> > functionality to a machine, or start a new machine with management
>> > functionality. Any Juju machine can potentially participate as a
>> > Juju manager - this command adds a new such manager. Note that
>> > there should always be an odd number of active management machines,
>> > otherwise the Juju environment is potentially vulnerable to
>> > network partitioning. If a management machine fails, a new one
>> > should be started to replace it.
>>
>> I would probably avoid putting such an emphasis on "any machine can be
>> a manager machine". But that is my personal opinion. (If you want HA
>> you probably want it on dedicated nodes.)
>>
>> >
>> > options: --constraints  (= ) additional machine constraints.
>> > Ignored if --to is specified. -e, --environment (= "local") juju
>> > environment to operate in --series (= "") the Ubuntu series of the
>> > new machine. Ignored if --to is specified. --to (="") the id of the
>> > machine to add management to. If this is not specified, a new
>> > machine is provisioned.
>> >
>> > usage: juju remove-management [options] <machine-id> purpose:
>> > Remove Juju management functionality from the machine with the
>> > given id. The machine itself is not destroyed. Note that if there
>> > are less than three management machines remaining, the operation of
>> > the Juju environment will be vulnerable to the failure of a single
>> > machine. It is not possible to remove the last management machine.
>> >
>>
>> I would probably also remove the machine if the only thing on it was
>> the management. Certainly that is how people want us to do "juju
>> remove-unit".
>>
>>
>> > options: -e, --environment (= "local") juju environment to operate
>> > in
>> >
>> > As a start, we could implement only the add-management command, and
>> > not implement the --to flag. That would be sufficient for our HA
>> > deliverable, I believe. The other features could be added in time
>> > or according to customer demand.
>>
>> The main problem with this is that it feels slightly too easy to add
>> just 1 machine and then not actually have HA (mongo stops allowing
>> writes if you have a 2-node cluster and lose one, right?)
>>
>> John
>> =:->
>>
>> >
>> >> I know this is not following the add-machine format, but I think
>> >> it would be better to migrate that to something more like this:
>> >>
>> >> juju machine --add
>> >
>> > If we are going to do that, I think we should probably change all
>> > the commands at once - consistency is good.
>> >
>> > If we do the above, could we drop "juju ensure-ha" entirely, given
>> > the fact that the above commands are both easier to implement (I
>> > think!) and more powerful?
>> >
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.13 (Cygwin)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>
>> iEYEARECAAYFAlJ8vYQACgkQJdeBCYSNAAMv7ACeJ7N8g5MeV3XE230/qjAcYE8m
>> kUgAoLrJ0L1vD9zzszwgFHgI8G/gomJO
>> =rl+3
>> -----END PGP SIGNATURE-----
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20131108/62f69eb5/attachment-0001.html>


More information about the Juju-dev mailing list