Session copying and i/o timeout bug
Michael Foord
michael.foord at canonical.com
Wed Jul 16 15:52:41 UTC 2014
Hey all,
I'm working on the "i/o timeout bug" [1]. We are assuming this is due to
us using a single global session for all communication with mongo. As
this bug is high importance I'm sharing my current status in case anyone
wants to help parallelise the work (see below):
https://bugs.launchpad.net/juju-core/+bug/1307434
The right fix seems to be to copy sessions (defering close) whenever we
talk to mongo. This uses the socket pooling built into mgo. Doing that
causes auth failures everywhere, this is because we change the mongo
password on starting jujud - so copying the session uses the wrong
credentials. A fix for this is to reopen the state after we change the
password:
http://pastebin.ubuntu.com/7803855/
The two "core" places I have started to look at changing our mongo use
to copy sessions are the watchers and the state transaction runners.
Both changes cause a great deal of test failures that need investigating
and fixing. This includes some auth failures (although less than
before). Presumably other places (test and production) also change the
mongo password.
The basic approach I'm taking for watchers is initially "quick and
dirty" to root out the problems. A proper abstraction over this is
needed. But it needs to work first.
Watchers run queries against collections. These have a reference to the
session. I've added a NewCollection function that copies the session and
returns a new collection, plus a closer function. The query can then be
run with the new session. The diff below changes two places in the
watcher code, and breaks a lot of things. Hopefully mostly due to the
same causes:
http://pastebin.ubuntu.com/7803950/
For transaction runners I'm just creating a new runner for each
transaction. They're cheap, but we're layering them with our own Runner
to provide test hooks. Again this change breaks things (and again - a
better abstraction for this is needed):
http://pastebin.ubuntu.com/7803890/
I'm looking at the watcher breakages. If anyone wanted to pickup looking
at the transaction runner breakages then we could move faster. We'd need
to stay in sync as some of the root causes will be the same (especially
auth failures).
All the best,
Michael Foord
More information about the Juju-dev
mailing list