Evaluating Juju for use in a large system

Sat Sep 1 02:48:57 UTC 2012

Hello,

I'm working on a project which requires robust and simple-to-use
service deployment, configuration, and coordination functionality.
This appears to be an area where Juju will excel (and already does to
some extent.)

I am trying to decide whether I should recommend that the project move
forward and be developed on top of Juju. Note, this decision has to be
made quite soon, so I really need to be able to gauge the state of
Juju at this point in time. If I can find answers to the following
questions then it will make my decision process much easier.

1) The project I'm working on is going to be largely Python based, so
the fact that Juju is also written in Python is a big win. Could
someone elaborate on the rationale for switching to Go? I have seen
the presentation from Gustavo at Google I/O where he briefly mentions
error handling and pitfalls of Twisted's callback model, however, some
more details on the subject would be appreciated.

2) Following on the last question, when is the Go version going to be
out of development and considered ready for use and/or production?
Will there be a production release of the Python version?

When I refer to "service instance" below I mean the actual service
(E.g., wordpress, mysql) running inside the service unit.

When testing Juju I came across a couple behaviours I didn't
understand and would like to know if there is a way around them or if
there's a plan to fix them. If there's a plan to fix them then an
approximate timeline would be much appreciated.

3) I performed some tests with deploying a RabbitMQ service with
multiple units (using the charm from `bzr branch
lp:charms/rabbitmq-server`.) One of the tests I executed involved
running deploy followed by two `juju add-unit` commands back to back.
The first two units spawned fine however the last received an empty
value for the Erlang cookie value and as such the RabbitMQ service
instance was unable to start. I'm wondering if there are any open
issues around race conditions with deploy/add-unit.

>>>Now that I look closer at the rabbitmq-server charm, I'm surprised this was even possible since it checks to see if the cookie value is empty.

This was the output of `juju status` after the failure happened (I
tried running `juju resolved` and `juju resolved --retry` on it
without any luck.):

      rabbitmq/2:
        agent-state: started
        machine: 0
        public-address: 192.168.122.25
        relation-errors:
          cluster:
          - rabbitmq

4) Is there a way to ensure that service instances will be able to
perform their clustering operations with their peers before the
dependant relations are notified about their presence? If not, is this
something which is planned to be supported? I did stumble across this
post (https://lists.ubuntu.com/archives/juju/2012-February/001258.html)
which seems to touch on the subject, but I'm not sure what the outcome
was. I ran into this when I was trying to test deploying and scaling a
RabbitMQ cluster. I found that when I had another application which
depended on RabbitMQ, the other application would be notified as soon
as new RabbitMQ service unit was added, even before RabbitMQ had a
chance to cluster.

There were some other things which I came across during testing which
I thought would be useful features. I'm wondering if any of these are
on the roadmap. If they are, will they be included in the Python code
base?

5) It would be nice if there was a graceful shutdown mechanism so that
the service instance could be notified and tidy up before the unit is
destroyed.

6) Many services can benefit from having a "locked" state whereby they
will allow pending request processing to finish but not accept any
more requests. Is there any plan to expose this sort of mechanism to
the services? Note, it would mainly be used to "lock" and "unlock"
individual service instances so I imagine the command would need to be
targeted at service units.

7) When it comes to destructive operations like remove-unit,
destroy-service, etc. it might be nice if there was an acknowledgment
system which would allow service instances to nack if the operation
was considered invalid/dangerous, e.g., removing units such that there
would no longer be enough to satisfy a replication factor. Of course,
there would still be a --force option available. Is this type of
feature on the roadmap?

Again, Juju looks like a great project, and I hope I can use it as a
key building block in a large scale system. If I can find answers to
these questions it would be an excellent first step!

Thanks,
-Torin