Planning for Juju 2.2 (16.10 timeframe)

Sat Mar 19 01:02:19 UTC 2016

On 9 March 2016 at 10:51, Mark Shuttleworth <mark at ubuntu.com> wrote:

> Operational concerns

I still want 'juju-wait' as a supported, builtin command rather than
as a fragile plugin I maintain and as code embedded in Amulet that the
ecosystem team maintain. A thoughtless change to Juju's status
reporting would break all our CI systems.

> Core Model

At the moment logging, monitoring (alerts) and metrics involve
customizing your charm to work with a specific subordinate. And at
deploy time, you of course need to deploy and configure the
subordinate, relate it etc. and things can get quite cluttered.

Could logging, monitoring and metrics be brought into the core model somehow?

eg. I attach a monitoring service such as nagios to the model, and all
services implicitly join the monitoring relation. Rather than talk
bespoke protocols, units use the 'monitoring-alert' tool send a JSON
dict to the monitoring service (for push alerts). There is some
mechanism for the monitoring service to trigger checks remotely.
Requests and alerts go via a separate SSL channel rather than the
relation, as relations are too heavy weight to trigger several times a
second and may end up blocked by eg. other hooks running on the unit
or jujud having been killed by OOM.

Similarly, we currently handle logging by installing a subordinate
that knows how to push rotated logs to Swift. It would be much nicer
to set this at the model level, and have tools available for the charm
to push rotated logs or stream live logs to the desired logging
service. syslog would be a common approach, as would streaming stdout
or stderr.

And metrics, where a charm installs a cronjob or daemon to spit out
performance metrics as JSON dicts to a charm tool which sends them to
the desired data store and graphing systems, maybe once a day or maybe
several times a second. Rather than the current approach of assuming
statsd as the protocol and spitting out packages to an IP address
pulled from the service configuration.

>  * modelling individual services (i.e. each database exported by the db
> application)
>  * rich status (properties of those services and the application itself)
>  * config schemas and validation
>  * relation config
>
> There is also interest in being able to invoke actions across a relation
> when the relation interface declares them. This would allow, for example, a
> benchmark operator charm to trigger benchmarks through a relation rather
> than having the operator do it manually.

This is interesting. You can sort of do this already if you setup ssh
so units can run commands on each other, but network partitions are an
issue. Triggering an action and waiting on the result works around
this problem.

For failover in the PostgreSQL charm, I currently need to leave
requests in the leader settings and wait for units to perform the
requested tasks and report their results using the peer relation. It
might be easier to coordinate if the leader was able to trigger these
tasks directly on the other units.

Similarly, most use cases for charmhelpers.coordinator or the
coordinator layer would become easier. Rather than using several
rounds of leadership and peer relation hooks to perform a rolling
restart or rolling upgrade, the leader could trigger the operations
remotely one at a time via a peer relation.

> Storage
>
>  * shared filesystems (NFS, GlusterFS, CephFS, LXD bind-mounts)
>  * object storage abstraction (probably just mapping to S3-compatible APIS)
>
> I'm interested in feedback on the operations aspects of storage. For
> example, whether it would be helpful to provide lifecycle management for
> storage being re-assigned (e.g. launch a new database application but reuse
> block devices previously bound to an old database  instance). Also, I think
> the intersection of storage modelling and MAAS hasn't really been explored,
> and since we see a lot of interest in the use of charms to deploy
> software-defined storage solutions, this probably will need thinking and
> work.

Reusing an old mount on a new unit is a common use case. Single unit
PostgreSQL is simplest here - it detects an existing database is on
the mount, and rather than recreate it fixes permissions (uids and
gids will often not match), mounts it and recreates any resources the
charm needs (such as the 'nagios' user so the monitoring checks work).
But if you deploy multiple PostgreSQL units reusing old mounts, what
do you do? At the moment, the one lucky enough to be elected master
gets used and the others destroyed.

Cassandra is problematic, as the newly provisioned units will have
different positions and ranges in the replication ring and the
existing data will usually actually belong to other units in the
service. It would be simpler to create a new cluster, then attach the
old data as an 'import' mount and have the storage hook load it into
the cluster. Which requires twice the disk space, but means you could
migrate a 10 unit Cassandra cluster to a new 5 unit Cassandra cluster.
(the charm doesn't actually do this yet, this is just speculation on
how it could be done). I imagine other services such as OpenStack
Swift would be in the same boat.

-- 
Stuart Bishop <stuart.bishop at canonical.com>