constraints call notes/proposals/sync

Thu Feb 7 01:56:53 UTC 2013

Hey all.

Earlier today, we had a call about constraints. It was not especially
agenda-driven; many voices were heard, and viewpoints expressed, and
issues left to resolve. According to my understanding, however,
general agreement was reached on the following points:

  * Perfect backward compatibility with python is not a major concern.
  * We need to clearly communicate provisioning errors in status and
    provider some mechanism for resolving them (shuld be fine, we do
    roughly the same thing with units).
  * Cross-cloud compatibility is a major concern, now and forever; the
    language exposed by juju should be suitable for use in scripts that
    will run on who-knows-what-cloud.
  * Different providers are very different, but the language we define
    is valuable in proportion to its utility across clouds, so it is
    dangerous to encourage provider-specific constraints.
  * It is not unheard of for external pressure to be a factor -- it may
    be that people really need to (say) distribute units' instances
    across hosts in openstack, and it is hard to resist the temptation
    to just add a provider-specific constraint in that case, even when
    one knows intellectually that it's probably a bad idea long-term.
  * The "cpu" constraint as it stands is pretty much crack, and might
    not even be worth keeping on for ec2. Certainly not useful on any
    other cloud.
  * A "cores" constraint does make sense on just about every cloud we
    know of (and can ofc be ignored, as usual, if it doesn't fit the
    current environment).
  * We should favour simplicity in all things, and try to get something
    working as soon as possible.

The bulk of contention centered around a discussion of the original
instance-type constraint; the fundamental problem is in the tension
between the desire for a cross-cloud constraints language and the fact
the most clouds already define their own instance-type language (and
that there is believed to be a significant contingent of users who (if
you like) "think" in terms of instance-type more than they do in terms
of cores and memory). This is not unnatural or surprising, really; but
it poses a significant challenge in that scripts written "naturally"
with instance-type as implemented in python have a depressingly low
chance of working predictably -- or at all -- across clouds.

I am concerned that the proposed "ec2-type" language makes for a poor
model of the range of instance types available across clouds: look at
HP cloud, then EC2, then your friendly local openstack provider, and
then go look at SmartCloud and all the others... yeah. All these
providers have made decisions that surely make local sense, but the
actual results are very different in each case, and none of their
models maps cleanly to any other's, AFAICS. And I don't think we can
expect that the next half dozen public clouds will do any better on
this front ;).

So, the overwhelming advantage of this approach is that it gives us a
common language for instance types (with an option on defining more,
as (say) HP cloud grows in popularity); the disadvantage is that the
language is somewhat lacking in expressivity in the majority of clouds
and, worse, will be actively misleading in many cases. We can't be
sure that a given cloud has any nodes that map cleanly to any instance
type: imagine one with nothing but a vast number of tiny single-core
systems with 512M of RAM each, and a few 64-core beasts with 64G. We
are (in this somewhat contrived but, I think, illustrative case)
faced with a gloomy choice between (1) putting everything on the
beasts because they're the only systems that meet the hard values
specified and (2) implementing a funky matching layer that fuzzes the
ec2 constraints up or down in the service of better matching... but I
think that the first is obviously crazy and I fear that the second
solution seriously erodes the value of defining a common language in
the first place (and certainly argues against privileging the ec2
language to the suggested degree).

I think it may be possible to find an alternative that allows users to
specify the instance types that make sense for their cloud explicitly,
but that does not behave in a damaging way when used in a surprising
context. Consider the following tweaks to the python system:

  * instance-type values are not checked at parse time.
  * instance-type values no longer conflict with cores/mem, and are
    set and inherited independently.
  * instance-type values are considered before cores/mem, such that
    a specified instance-type that is available in the current cloud is
    always chosen regardless of specified cores/mem.
  * any unrecognised instance-type constraint is just ignored -- it
    behaves in every respect as though it were unset -- but a warning
    is attached to the machine describing what constraint was ignored
    (this would use essentially the same pathway as agreed above re
    provisioning errors, I think).

I think this makes for quite a nice story from a number of viewpoints:

  * as someone who wants to experiment with juju on the cloud I already
    use, I can deploy services directly onto the instance types I know
    and love, out of the box.
  * as someone who's trying out a script written by the first guy on a
    different cloud, or following some dude's tutorial on the internet,
    I don't get any valid instance-type information and so deploy as
    though without constraints; but by looking at my machines in status,
    I can immediately see what happened, and can trivially set my own
    requirements either as cores/mem or instance-type.
  * as the first user again, wanting my scripts to be more portable, it
    is a simple matter to add rough cores/mem constraints to the
    script, alongside what I know to be the "real" instance-type values.
  * as a battle-scarred veteran of a hundred clouds, I just ignore that
    instance-type nonsense and use cores/mem from the start, and my
    scripts will work everywhere.
  * as a user of a weird cloud consuming the battle-scarred veteran's
    script, I can tune it to my environment by specifying instance-type
    without disturbing the "real" cores/mem values (and thereby
    reducing the script's clarity and value to others).

In every case, the mental model is clear, and it's easy and natural to
make juju do what you want. Like everything, it has drawbacks, but I
don't believe any of them is a showstopper:

  * instance-type-only constraints that don't apply will be ignored and
    fall back to environment defaults; but providing a sensible default
    in the absence of constraints is, I think, a necessary aspect of a
    provider implementation: while it's not necessarily an easy problem
    in all cases, a given Environ ought usually to be able to figure
    out something that isn't completely crazy. It has to, because no-
    constraints is the default.
  * when the same script is run against two environs that share names of
    some or all instance-types used, the results on the two clouds will
    differ to the degree by which the homonymous instance-types differ.
    To restate: if an instance-type constraint is applying your cloud,
    it is almost overwhelmingly likely that this is because it was
    written specifically for your cloud; hence, the language most
    appropriate for further tuning is the language of your cloud.
  * it will probably in this context be sensible to allow instance-type
    to take a list of strings, in that a script can thereby be tuned
    incrementally and strictly additively, to take specific advantage
    of more clouds as they become available; such a feature would
    probably increase the impact of the preceding point to some degree,
    but my waving hands assert that it's probably not a big deal.

And, as a final bonus, this is easy to implement: I think the approach
allows us to entirely eliminate the concept of constraint conflicts
(which is a good thing from a user perspective: a whole source of
potential confusion can be eliminated, and replaced with a simple note
that instance-type, if valid, overrides cores/mem, and that's the only
special case).

I think that exactly the same model also applies perfectly to "arch".
Unrecognised arch constraints can be detected and ignored when
assigning units and provisioning machines, and a note can be added to
the machines stating that this has occurred; apart from anything else,
if we try to define the One True Arch Language we're stuck perpetuating
the appalling state of my python implementation, in which adding a new
"arch" value requires changing the source code; and in which an environ
has no way to intercept a known-invalid arch at the command-line, and
can only fail at provisioning time in a not-very-discoverable way ;).

And, I think, all the same arguments apply to "tags" (the purported
generic version of "maas-tags" -- unrecognised ones would be ignored
and all others would apply with full force) and even "instance-name"
("maas-name" -- ignore the constraint if the node's taken? seems
better to me than the current quiet failure...).

Regardless: my understanding is that we can't skip implementing these
constraints in some way, and my desire is that we go as far as we can
before reintroducing provider-specific constraints to the language. I
don't think we've hit a need for those yet.

...so. If the above holds water, we have a common model that applies
sanely across the vast majority of use cases we are aware of for the
following pre-existing constraints: arch, mem, instance-type, maas-name
and maas-tags. I think that it's reasonable to drop cpu, and to
introduce cores in its place; the only remaining concerns are ec2-zone
and os-scheduler-hints. They, however, are a matter for another email.

On a final note, that will feed into the followup: I have a suspicion
that not everyone is on the same page wrt expectations for what
constraints will provide in 13.04. If I am completely off-base in my
understanding that parity is the target (and that I will have some
explaining to do if I drop features without providing equivalent or
better functionality), I would be grateful to every single person who
independently confirmed this to me.

Cheers
William