Sprint Feedback

Thu Jun 2 17:15:41 UTC 2011

Tom, clearly you've got our attention. Thanks so much for all the
feedback. Comments in-line.

Excerpts from Tom Haddon's message of Thu Jun 02 05:02:11 -0700 2011:
> Dear Ensemble Team,
> 
> == Things we like about Puppet ==
> 
> - Declarative state. This makes it easier to manage services over the
> longer term, because you can be assured that systems are configured the
> way you've told them to be configured.

I'm interested in people taking a shot at writing a formula with Puppet
for this very reason. It may simplify some services in that one won't
need to keep track of what has been done, since Puppet is already good
at that.

> 
> - Clean syntax and very simple to deploy services.
> - Powerful concepts that hold the promise of allowing easy scalability
> of services.
> 
> == Things we don't like about Ensemble ==
> 
> - Ensemble seems to currently require a cloud infrastructure (EC2/S3
> specifically) to run. Are there plans in the future to allow Ensemble to
> run on bare metal? Our usage of EC2 has been limited for a number of
> reasons, including cost and performance. If the plan was to only ever
> have Ensemble work on EC2, that'd make it hard to adopt it for our
> services.

I'm curious if you discussed the possibility of OpenStack based deployment.
It seems a bit daft to run OpenStack on a box for it to only provide one
"machine" in the form of an LXC container. 

However, decoupling "the machine" from "the service" is actually one of
my favorite concepts of Ensemble, though it is quite different from the
traditional paradigm.

Anyway, running against OpenStack should work now (though I believe its
untested), and would allow running Ensemble managed services without
EC2 or virtualization. Of course, LXC is broken in lucid at the moment,
so it also pretty much means using Natty. Doh.

> - Can't preview changes before they happen to determine if they will do
> what you want them to do. Can't test out new versions of different
> formulas with different "environments".

$ ensemble deploy
usage: ensemble deploy [-h] --repository REPOSITORY
                       [--environment ENVIRONMENT]
                       formula [service_name]

Each environment is specified in ~/.ensemble/environments.yaml, and can
set things to control the machine provider that includes what hostname
to contact via the EC2/AWS API (allowing segregation by "cloud"). One 
can also segregate further by changing the AWS secret key information.

> 
> == Some other comments based on the example formulas ==
> 
> - The "utility instance" seems to be a single point of failure. If this
> goes down do we lose access to everything?
> - Once you've hooked items together, it's confusing to me that the
> "mysql" service is saying it's relation is "db: wordpress" - wordpress
> isn't a DB, so shouldn't this be saying "app: wordpress" or "db for:
> wordpress"?

I actually think this makes perfect sense. The example really shouldn't
call the service "wordpress", it should call it "myblog". Then db: myblog
makes a lot more sense. Its *providing* the db for myblog.

> - When you add-unit to the wordpress instance, I don't see how this
> actually provides any scalability. Presumably you'd need to be using
> round robin DNS, or have a load balancer in front of all these
> instances, or something like that?

Check out the mediawiki demo we have in principia for how to integrate
with haproxy.

Setup your AWS credentials and install ensemble the same way you did
for the examples. Then:

bzr branch lp:principia-tools
cd principia-tools
scripts/getall
tests/mediawiki.sh

This will spawn quite a few nodes.. 1 mysql db, 2 memcached, 2 mediawiki,
and 1 haproxy, plus the bootstrap node. Once they're all running the
haproxy node's public IP *should* present you with a mediawiki instance.

You can push its scalability by doing 'ensemble add-unit demo-wiki'. If
you start to push the query capability of the master db server, you can
add a slave with

ensemble deploy --repository=formulas mysql slave-db
ensemble add-relation slave-db:slave wiki-db:master

There's a bit of a disconnect here, as you have to
wait for this relation to be fully up before you can relate it to
mediawiki. I'm still working out if there's a way to do that without
manually waiting.

ensemble add-relation demo-wiki:slave slave-db:db

You can also deploy a munin node to monitor with

ensemble deploy --repository=formulas munin munin-wiki

Then

ensemble add-relation wiki-db munin-wiki
ensemble add-relation slave-db munin-wiki
ensemble add-relation wiki-balancer munin-wiki
ensemble add-relation demo-wiki munin-wiki
ensemble add-relation wiki-cache munin-wiki

All of them should eventually show up at the munin machine's public
ip at /munin. Note that there's a bug in the txzookeeper library that
seems to affect this munin formula when load gets high (t1.micro isn't
actually powerful enough to run munin for all these nodes).

You can see where this is tedious, and Kapil's previously mentioned "policy"
concept is sorely needed, as ideally you'd just be able to set a policy to
just install munin-node on all machines and relate them to the munin machine.
This is something that is easy in config management, because they are built
to model *machines*, but hasn't been correctly modeled in ensemble yet. It
should actually be trivial once we figure out how it should work.

> - Can you use your own AMI? Different instance sizes?
> - How do you apply security updates to running instances, etc.?

I think this is something that the agents will eventually handle (I hope),
something like mcollective's agents, where you can ask a particular class
of machines to run the "apply all updates" agent.

For now ssh is the only way to do this. Its pretty easy to get a list of
machines from 'ensemble status', which is basically just yaml.

> - Shouldn't the formulas include author info in the yaml? I'd be loathe
> to create my own formulas based on those someone else has provided
> unless I know who I can go to if I have problems with the formula. Also,
> is there any promise of version compatibility, or is it possible that if
> you create formulas that import other formulas that your own formula
> will no longer work?

Formulas are going to be quite tied to revision control. Right now you
can tell who the authors of the principia formulas are by running 'bzr log'
on their branches. I do see that having a responsible party listed in the
YAML would be helpful though.

> - Can it use elastic IPs (DNS and for interacting with "static"
> services)? Can it interact with services that are not part of Ensemble
> (i.e. DB servers that are in a DC rather than in EC2, or servers that
> you don't want to run with Ensemble for some other reason)?

We did put on the road map the concept of a "virtual service" which is
outside of ensemble and just exposes the config details for sending
externally. At the cost of an EC2 machine, you can of course write a
formula which simply relates to the service you're interested in, and
then communicates those details to the external systems.

> - What security is there in terms of if one server in an ensemble
> cluster is compromised? How much information is shared between the
> instances with zookeeper and what's to prevent one server from querying
> all information on other servers?
> - What is the Ensemble approach to firewalls? Is it expected that this
> is a formula issue?

I did at one point add firewalling to the memcached formula, since
memcached wasn't configured for SASL so it would be exposed to all
machines in the same amazon security group. But I took it out as it made
the formula more complex. I do think eventually it will be in ensemble,
as will all the ip sharing that I've had to manually implement in formulas
by parsing ifconfig.

> - It's not entirely clear to me if you could use Ensemble to replace our
> current deployment scripts - they are used to push out incremental code
> updates to specific services, and work by copying code into a directory
> that includes a unique identifying string (usually the bzr revision of
> the code in question), bringing the service down, checking it's down,
> switching the symlink for the code directory we're expecting to find the
> active code in to the directory we've previously pushed to, and then
> restarting services, and then checking the service is up. This can be
> done in parallel or serial, or a combination of both (groups of servers
> serially, each group in parallel). We can also add in custom hooks to do
> things like "set read-only mode" for a given service fairly trivially.

This would fit nicely in the upgrade-formula hook. We talked at one time
about having arguments for upgrade-formula which were like '--rolling' or
'--parallel' to control whether nodes were all done at once or in serial.
I'm not sure how its done now though.

> 
> == What's next? ==
> 
> Our plans from here are to continue testing Ensemble so that we can try
> to realistically get an idea of what works for us and what doesn't over
> the long term. Initially this involves testing how it deals with a bunch
> of error states, but then we'd also like to begin writing some formulas
> (I guess participating in https://launchpad.net/principia would be the
> best thing here).

Yes please!

> 
> I think the overall takeaway as far as we can see is that Ensemble seems
> suited for deploying services, but not necessarily managing services. Is
> the idea that you would need to deploy your own management layer through
> Ensemble, or outside of Ensemble, or is the idea that in the future
> Ensemble will be able to manage services for you?

As Gustavo said, its whole purpose is managing services. Getting deployment
right means that implementing long term management *should* be simple.

Also I want to call attention to how small all of the hook scripts are.

The most complicated ones are < 100 lines of python, and even then a
lot of that is inline templates. I wrote a few pieces in PHP to show how
simple it can be (and to make it simpler to integrate with a PHP web app).

I know that Puppet modules aren't towers of code complexity. But learning
Puppet's DSL to the degree where you can be comfortable with exported
configs and resources, or learning enough Ruby to do these types of
deployments in Chef, usually means getting out of your comfort zone for
a while. Ensemble is trying to address this friction by saying "we will
run this command, at this point, and respond to these commands in this
way". Any language or methodology is appropriate in this model, which
should make it really easy to share and enhance formulas around their
respective services in a way that encourages collaboration.