Session copying and i/o timeout bug

Michael Foord michael.foord at canonical.com
Wed Jul 16 15:52:41 UTC 2014


Hey all,

I'm working on the "i/o timeout bug" [1]. We are assuming this is due to 
us using a single global session for all communication with mongo. As 
this bug is high importance I'm sharing my current status in case anyone 
wants to help parallelise the work (see below):

     https://bugs.launchpad.net/juju-core/+bug/1307434

The right fix seems to be to copy sessions (defering close) whenever we 
talk to mongo. This uses the socket pooling built into mgo. Doing that 
causes auth failures everywhere, this is because we change the mongo 
password on starting jujud - so copying the session uses the wrong 
credentials. A fix for this is to reopen the state after we change the 
password:

     http://pastebin.ubuntu.com/7803855/

The two "core" places I have started to look at changing our mongo use 
to copy sessions are the watchers and the state transaction runners.  
Both changes cause a great deal of test failures that need investigating 
and fixing. This includes some auth failures (although less than 
before). Presumably other places (test and production) also change the 
mongo password.

The basic approach I'm taking for watchers is initially "quick and 
dirty" to root out the problems. A proper abstraction over this is 
needed. But it needs to work first.

Watchers run queries against collections. These have a reference to the 
session. I've added a NewCollection function that copies the session and 
returns a new collection, plus a closer function. The query can then be 
run with the new session. The diff below changes two places in the 
watcher code, and breaks a lot of things. Hopefully mostly due to the 
same causes:

     http://pastebin.ubuntu.com/7803950/

For transaction runners I'm just creating a new runner for each 
transaction. They're cheap, but we're layering them with our own Runner 
to provide test hooks. Again this change breaks things (and again - a 
better abstraction for this is needed):

     http://pastebin.ubuntu.com/7803890/

I'm looking at the watcher breakages. If anyone wanted to pickup looking 
at the transaction runner breakages then we could move faster. We'd need 
to stay in sync as some of the root causes will be the same (especially 
auth failures).

All the best,

Michael Foord






More information about the Juju-dev mailing list