Mongo experts - help need please
Gustavo Niemeyer
gustavo.niemeyer at canonical.com
Fri Jul 25 14:21:34 UTC 2014
On Fri, Jul 25, 2014 at 5:29 AM, Stuart Bishop
<stuart.bishop at canonical.com> wrote:
> On 25 July 2014 12:05, Gustavo Niemeyer <gustavo.niemeyer at canonical.com> wrote:
> The bug Ian cites and is trying to work around has sessions failing
> with an i/o error after some time (I'm guessing resource starvation in
> MongoDB or TCP networking issues). session.Copy() is pulling things
> from a pool, so it might be handing out sessions doomed to fail with
> exactly the same issue. The connections in the pool could even be
> perfectly functional when they went in, with no way at the go level of
> knowing they have failed without trying them.
That's not actually the bug Ian is asking information about in this thread.
The reason why the timeouts happen is well understood: MongoDB has a
fixed timeout of 10 minutes, and mgo right now does not concurrently
ping a socket that was reserved for a session. Using a single session
forever and never calling Refresh on it will surely timeout if it
stays unused for that long.
The solution is simple: call Refresh at a control point (where that is
depends on the application shape) or Close a copy of the session and
let the pool internally deal with it, and do handle any errors when
they happen.
> If this is the case, then Ian would need to handle the failure by
> ensuring the failed connection does not go back in the pool and
> grabbing a new one (the defered Close() will return it I think). And
> repeating until it works, or until the pool has been exhausted and we
> know Mongo is actually down rather than just having a polluted pool.
There's no reason to do that. The pool can deal with connection errors
and timeouts, and collects bad sockets appropriately.
Trying to ensure a bad socket never comes out of the pool is also a
bad path. It's impossible to guarantee that a socket obtained from mgo
or any other database driver is indeed in perfect state. Failures can
happen the nanosecond after any tests are made. The reliable way is to
handle errors appropriately, fallback to a sane path, and retry from
there.
gustavo @ http://niemeyer.net
More information about the Juju-dev
mailing list