thoughts on priorities

Tue Apr 30 08:49:08 UTC 2013

Hi all

Now that the immediate dust of release has settled, it's important that
we take stock of our situation to make sure we're heading in a sensible
direction. While the bugs that come in from actual users often need
immediate attention, and the cloud sprint next week will help us
determine further priorities for the coming months, there's one specific
feature we should be thinking about right now so we can understand the
impact it'll have on the rest of our development this cycle.

That feature is the internal API, and it's (1) definitely necessary for
security and (2) almost certainly necessary to allow us to scale out
without requiring unreasonably large instances for our state servers.
Sadly, this feature has tentacles. I have some ideas for how we can do
this sanely, but I've probably missed something; I'd appreciate critical
feedback on the following.

The most critical aspect in my view is that the internal API offers us a
chance to insert something resembling a sane security model; having
direct state access across the board has led to us using the environment
config (containing important secrets) in the Uniter and the Upgrader
tasks, both of which are run on agents that should absolutely not be
considered worthy of that trust.

The Uniter's easy to fix: we just need an API method that returns the
provider type so we can create an EnvironProvider (which doesn't need
secrets) to get our private/public addresses.

However, the Upgrader change has tentacles. At the moment, every single
agent requires the environment secrets in order to download fresh tools
on upgrade, and this is unacceptable. We need to be able to watch for
tools changes, and get their URLs, via the API, and this will involve
some careful thinking re sane design (I would very much like us to model
it such that we can cleanly switch to a canary-capable implementation in
the future...).

And, finally, the connection/task-running code inside jujud will also
need some work to deal with those agents that are only allowed API
connections.

On the upside, we don't actually need the internal API to cover every
part of state. Certain tasks are trusted by necessity, either because
they legitimately use the environment keys (Provisioner, Firewaller) or
because they require direct state access to do their job (the API
server, whatever putative task ends up responsible for the state server
itself).

Happily, all these tasks are ones that we only need a few of, and can
therefore comfortably restrict to only running on machines that are
running state servers (and are therefore implicitly trusted regardless);
the ones we need to scale out in a big way are also the ones that we
want to restrict to API-only access. These are:

  * Uniter
  * Machiner
  * Deployer
  * Upgrader
  * (and the jujud code)

...and (aside from the aforementioned changes to Uniter/Upgrader) they
do use a somewhat restricted subset of the current state API, which I
think will somewhat reduce the implementation burden.

I'll be going through these today to try to figure out the details a
little more; once we have those nailed down, we'll need to figure out
how many people can reasonably swarm on this work without treading on
each other's toes.

However, there's another potential tentacle. The switch to API-only
access will be, uh, challenging to pull off as a perfectly compatible
upgrade; when we disable state access for agents on untrusted machines,
we'll need to be certain that they all run code that can handle API-only
access. So, we also need to be considering major-version upgrades as a
matter of urgency; and also as a matter of simple prudence, because I
have a lurking suspicion that sooner or later we'll hit *some* scaling
problem that can only be resolved by a schema change, and I'd like us to
be prepared.

If anything I've said is wrong or stupid, please let me know; otherwise,
I would be most grateful if someone were to step up and write a
reasonably detailed proposal for a major-upgrade model that we can
implement in a reasonably short time. Anyone?

Cheers
William