Better handling of MongoDB disconnects due to new replicaset members

Mon Aug 15 00:11:19 UTC 2016

Just to round out this thread, this issue has now been dealt with (thanks
Tim!). The server now translates these errors into a more useful error that
indicates the client should retry the required. The Juju commmand line
client transparently intercepts these errors and retries.

Here's the relevant pull request: https://github.com/juju/juju/pull/5927

On 26 July 2016 at 14:42, Reed O'Brien <reed.obrien at canonical.com> wrote:

> On Mon, Jul 25, 2016 at 5:38 PM, Menno Smits <menno.smits at canonical.com>
> wrote:
>
>> Regarding https://bugs.launchpad.net/juju-core/+bug/1597601 ...
>>
>> When "juju enable-ha" is used, new controller machines are started, each
>> running a mongod instance which is connected to Juju's replicaset. As each
>> new node joins the replicaset a MongoDB leader election is triggered which
>> causes all mongod instances in the replicaset to drop their connections
>> (this is by design). The workers in the Juju's machine agents handle this
>> correctly by aborting and restarting with fresh connections to MongoDB.
>>
>> The problem is that if an API request comes in at just the right time, it
>> will be actioned just as the MongoDB connection goes down, resulting in the
>> i/o timeout error being reported back to the client.
>>
>> This isn't a new problem but it's one that Juju's users regularly run in
>> to. A workaround is to wait for the new controller machines to come up
>> after enable-ha is issued before doing anything else.
>>
>> IMHO it would be best if Juju could hide all this from the client as much
>> as possible but I'm really not sure if that's feasible or what the best
>> approach should be.
>>
>> The challenge is that unless we do some major rearchitecting, the API
>> server needs to be restarted when the MongoDB connections drop. There's no
>> way to that the client's connection can stay up, making it difficult to
>> hide this detail from the client.
>>
>
> It seems that mgo could handle this as a failover. Or that we could see
> that the replica set is starting and wait until it reports being up, then
> refresh the mgo session. I don't understand why the API server itself has
> to restart, though I am sure there are good reasons.
>
>
>>
>> The most practical solution I can think of is that we introduce a new
>> error type over the API which means "please retry the request". Errors such
>> as an i/o timeout from the MongoDB layer could be converted into this
>> error. Clients would obviously have to handle this error specially.
>>
>
> Barring handling it via mgo session this seems obvious and practical.
>
>
> ~ro
>
> --
> Reed O'Brien
> ✉ reed.obrien at canonical.com
> ✆ 415-562-6797
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160815/f98f1654/attachment.html>