Better handling of MongoDB disconnects due to new replicaset members

Tue Jul 26 00:38:14 UTC 2016

Regarding https://bugs.launchpad.net/juju-core/+bug/1597601 ...

When "juju enable-ha" is used, new controller machines are started, each
running a mongod instance which is connected to Juju's replicaset. As each
new node joins the replicaset a MongoDB leader election is triggered which
causes all mongod instances in the replicaset to drop their connections
(this is by design). The workers in the Juju's machine agents handle this
correctly by aborting and restarting with fresh connections to MongoDB.

The problem is that if an API request comes in at just the right time, it
will be actioned just as the MongoDB connection goes down, resulting in the
i/o timeout error being reported back to the client.

This isn't a new problem but it's one that Juju's users regularly run in
to. A workaround is to wait for the new controller machines to come up
after enable-ha is issued before doing anything else.

IMHO it would be best if Juju could hide all this from the client as much
as possible but I'm really not sure if that's feasible or what the best
approach should be.

The challenge is that unless we do some major rearchitecting, the API
server needs to be restarted when the MongoDB connections drop. There's no
way to that the client's connection can stay up, making it difficult to
hide this detail from the client.

The most practical solution I can think of is that we introduce a new error
type over the API which means "please retry the request". Errors such as an
i/o timeout from the MongoDB layer could be converted into this error.
Clients would obviously have to handle this error specially.

Does anyone have another idea?

- Menno
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160726/18af7097/attachment.html>