Evaluating Juju for use in a large system

Sat Sep 1 14:13:00 UTC 2012

Excerpts from Torin Sandall's message of 2012-08-31 19:48:57 -0700:
> Hello,
> 
> I'm working on a project which requires robust and simple-to-use
> service deployment, configuration, and coordination functionality.
> This appears to be an area where Juju will excel (and already does to
> some extent.)
> 

First,welcome!

We think Juju will excel in such systems as well. :)

> I am trying to decide whether I should recommend that the project move
> forward and be developed on top of Juju. Note, this decision has to be
> made quite soon, so I really need to be able to gauge the state of
> Juju at this point in time. If I can find answers to the following
> questions then it will make my decision process much easier.
> 
> 1) The project I'm working on is going to be largely Python based, so
> the fact that Juju is also written in Python is a big win. Could
> someone elaborate on the rationale for switching to Go? I have seen
> the presentation from Gustavo at Google I/O where he briefly mentions
> error handling and pitfalls of Twisted's callback model, however, some
> more details on the subject would be appreciated.
> 

The reasons Gustavo gave are the ones for the language choice. I'd like
to think that users of Juju won't need to hack on it very much... but
if you do, Go is a pretty straight forward language I think.

> 2) Following on the last question, when is the Go version going to be
> out of development and considered ready for use and/or production?
> Will there be a production release of the Python version?
> 

You can follow the development of the Go version here:

https://launchpad.net/juju-core

Version 2.0 is due in early October and should achieve feature parity
with the python version (on ec2 only though).

The python version is receiving maintenance and important bug fixes.
Its used in production in some places, though I always counsel people
to check these two lists and have workarounds for these issues in mind
when deciding to deploy it:

https://bugs.launchpad.net/juju/+bugs?field.tag=security
https://bugs.launchpad.net/juju/+bugs?field.tag=production

Many of these bugs are being fixed, though I can't make any promises
that they will *all* be fixed.

> When I refer to "service instance" below I mean the actual service
> (E.g., wordpress, mysql) running inside the service unit.
> 
> When testing Juju I came across a couple behaviours I didn't
> understand and would like to know if there is a way around them or if
> there's a plan to fix them. If there's a plan to fix them then an
> approximate timeline would be much appreciated.
> 
> 3) I performed some tests with deploying a RabbitMQ service with
> multiple units (using the charm from `bzr branch
> lp:charms/rabbitmq-server`.) One of the tests I executed involved
> running deploy followed by two `juju add-unit` commands back to back.
> The first two units spawned fine however the last received an empty
> value for the Erlang cookie value and as such the RabbitMQ service
> instance was unable to start. I'm wondering if there are any open
> issues around race conditions with deploy/add-unit.
> 

There are issues with peer relations. They are fine for building a list of
the members of the cluster, but its impossible to predict the order and
there is no "leader election" support so the way the cookie generation
happens is actually not really reliable when adding units in parallel.

It might work better if you waited for the first unit before adding more
units. There's a command in the 'juju-jitsu' package (only in the latest
quantal or the JUJU PPA) called 'watch' that will help with this kind
of logic. Note that juju-jitsu lags development and is very experimental.

Either way, we're just now adding explicit test support to the official
charm store.  It sounds like a 3-node cluster would be a good test to
run and get passing.

> >>>Now that I look closer at the rabbitmq-server charm, I'm surprised this was even possible since it checks to see if the cookie value is empty.
> 
> This was the output of `juju status` after the failure happened (I
> tried running `juju resolved` and `juju resolved --retry` on it
> without any luck.):
> 
>       rabbitmq/2:
>         agent-state: started
>         machine: 0
>         public-address: 192.168.122.25
>         relation-errors:
>           cluster:
>           - rabbitmq
> 
> 4) Is there a way to ensure that service instances will be able to
> perform their clustering operations with their peers before the
> dependant relations are notified about their presence? If not, is this
> something which is planned to be supported? I did stumble across this
> post (https://lists.ubuntu.com/archives/juju/2012-February/001258.html)
> which seems to touch on the subject, but I'm not sure what the outcome
> was. I ran into this when I was trying to test deploying and scaling a
> RabbitMQ cluster. I found that when I had another application which
> depended on RabbitMQ, the other application would be notified as soon
> as new RabbitMQ service unit was added, even before RabbitMQ had a
> chance to cluster.
> 

This is definitely more complicated than it needs to be right now. One
method we could use is to have the cluster relation loop through the
amqp relations that have been established and do soething like this:

relation-set ... clustered=1

Allowing apps to require clustering. But its hard to get this right in
a generic way... need to think more about this one.

> There were some other things which I came across during testing which
> I thought would be useful features. I'm wondering if any of these are
> on the roadmap. If they are, will they be included in the Python code
> base?
> 

The bug list is long, and there are a lot of features to consider for
the future. However, its unlikely features will be implemented in the
python version as a priority.

> 5) It would be nice if there was a graceful shutdown mechanism so that
> the service instance could be notified and tidy up before the unit is
> destroyed.
> 

https://bugs.launchpad.net/juju/+bug/872264
https://bugs.launchpad.net/juju/+bug/932269

> 6) Many services can benefit from having a "locked" state whereby they
> will allow pending request processing to finish but not accept any
> more requests. Is there any plan to expose this sort of mechanism to
> the services? Note, it would mainly be used to "lock" and "unlock"
> individual service instances so I imagine the command would need to be
> targeted at service units.
> 

This could be implemented today in service configs. For unit-specific
things I usually recommend just using 'juju ssh'. In theory, the 'stop'
hook should handle this need when juju needs to stop the service, with
the start hook reversing its effects.

> 7) When it comes to destructive operations like remove-unit,
> destroy-service, etc. it might be nice if there was an acknowledgment
> system which would allow service instances to nack if the operation
> was considered invalid/dangerous, e.g., removing units such that there
> would no longer be enough to satisfy a replication factor. Of course,
> there would still be a --force option available. Is this type of
> feature on the roadmap?
> 

Right now, only 'terminate-machine' and 'destroy-environment'
will actually remove machines from your environment, and thus remove
data. remove-unit and destroy-service simply remove their definition from
juju, and cause other sides to have their broken/departed hooks called.
So, the consuming charms of a service need to make sure they don't cancel
any important operations when they get departed/broken, but that should
be sufficient.

This bug is kind of about simplification of that:

https://bugs.launchpad.net/juju/+bug/862422

Its marked as "Medium".. so there are no plans for the immediate future.

> Again, Juju looks like a great project, and I hope I can use it as a
> key building block in a large scale system. If I can find answers to
> these questions it would be an excellent first step!
> 

Its great to have your feedback Torin, please let us know if there is
anything else we can do to make your decision easier. :)