Mongo experts - help need please

Gustavo Niemeyer gustavo.niemeyer at canonical.com
Fri Jul 25 05:05:45 UTC 2014


On Fri, Jul 25, 2014 at 1:02 AM, Ian Booth <ian.booth at canonical.com> wrote:
> We've transitioned to using Session.Copy() to address the situation whereby Juju
> would create a mongo collection instance and then continue to make db calls
> against that collection without realising the underlying socket may have become
> disconnected. This resulted in Juju components failing, logging "i/o timeout"
> errors talking to mongo, even though mongo itself was still up and running.

Sounds sane, as I indicated in previous discussions about the topic in
these last two weeks and also about a year ago when we covered that.
Serializing every single request to a concurrent server via a single
database connection seems like a pretty bad idea for anything but
simplistic servers.

> As an aside - I'm wondering whether the mgo driver shouldn't transparently catch
> an i/o error associated with a dead socket and retry using a fresh connection
> rather than imposing that responsibility on the caller?

The evidence so far indicates that this will likely not happen. The
current design was purposefully put in place so that harsh connection
errors are not swept under the rug, and this seems to be working well
so far. I'd rather not have juju proceeding over a harsh problem such
as a master re-election midway through the execution of an algorithm
without any indication that the failure has happened, let alone
silently retry operations that in most cases are not idempotent.

That said, the goal is of course not to make the developer's life
miserable. All the driver wants is an acknowledgement that the error
was perceived and taken care of. This is done trivially by calling:

    session.Refresh()

Done. The driver will happily drop the error notice, and proceed with
further operations, blocking if waiting for a re-election to take
place is necessary.

That said, as stated above using a single session for _everything_
might not be a good idea for other reasons.

(...)
> If session.Copy() doesn't work here, what's the approach to use to ensure the
> watcher just doesn't become dead because the underlying socket dies? Or how can
> we make the session.Copy() approach work always even when the host machine is
> under high load? Or maybe watcher code is fine and the tests are wrong?

This feels very much like a concurrency or timing issue. You might
also be misunderstanding what session.Copy does.. it's not so magic.
If session.Copy truly prevented the watcher from working, it wouldn't
work at all either way. Every independent process that connects to the
database and does a change is monitored by watchers that live in
different sessions.

> The tests are quite simple:

I'm not able to observe the test failure you mention after hacking it
to use independent sessions:

http://paste.ubuntu.com/7852418/


gustavo @ http://niemeyer.net



More information about the Juju-dev mailing list