I'm concerned

Tim Penhey tim.penhey at canonical.com
Thu Jun 18 04:12:54 UTC 2015


OK, found it.  And it has nothing to do with leases.

I'm just proposing the fix now, but it has taken me most of the day to
diagnose and fix.

The certupdater worker was making the mistake of trusting a watcher. It
was blindly getting the addresses and updating the certificate. The
cases where the agent failed to stop was when for some reason, the
address watcher fired twice after the apiserver worker had shut down,
but before the certupdater worker was signalled to die (or before it
noticed). The certupdater worker communicates with the apiserver worker
through a buffered channel (with a one item buffer).  It was the second
notification that triggered the blocking channel send.

I added a memory to the cert updater, so it doesn't blindly update the
cert, but only when the addresses do in fact change.

I had a failure rate of between 20 and 40% before this change, and it
appears to be fixed now.

Tim

On 17/06/15 22:01, William Reade wrote:
> ...but I think that axw actually addressed that already. Not sure then;
> don't really have the bandwidth to investigate deeply right now. Sorry
> noise.
> 
> On Wed, Jun 17, 2015 at 10:52 AM, William Reade
> <william.reade at canonical.com <mailto:william.reade at canonical.com>> wrote:
> 
>     I think the problem is in the implicit apiserver->leasemgr->state
>     dependencies; if the lease manager is stopped at the wrong moment,
>     the apiserver will never shut down because it's waiting on a blocked
>     leasemgr call. I'll propose something today.
> 
>     On Wed, Jun 17, 2015 at 7:33 AM, David Cheney
>     <david.cheney at canonical.com <mailto:david.cheney at canonical.com>> wrote:
> 
>         This should be achievable. go test sends SIGQUIT on timeout, we can
>         setup a SIGQUIT handler in the topmost suite (or import it as a side
>         effect package), do whatever cleanup is needed, then os.Exit,
>         unhandle
>         the signal and try to send SIGQUIT to ourselves, or just panic.
> 
>         On Wed, Jun 17, 2015 at 3:25 PM, Tim Penhey
>         <tim.penhey at canonical.com <mailto:tim.penhey at canonical.com>> wrote:
>         > Hey team,
>         >
>         > I am getting more and more concerned about the length of time that
>         > master has been cursed.
>         >
>         > It seems that sometime recently we have introduced serious
>         instability
>         > in cmd/jujud/agent, and it is often getting wedged and killed
>         by the
>         > test timeout.
>         >
>         > I have spent some time looking, but I have not yet found a
>         definitive
>         > cause.  At least some of the time the agent is failing to stop
>         and is
>         > deadlocked.
>         >
>         > This is an intermittent failure, but intermittent enough that
>         often at
>         > least one of the unit test runs fails with this problem
>         cursing the
>         > entire run.
>         >
>         > One think I have considered to aid in the debugging is to add
>         some code
>         > to the juju base suites somewhere (or in testing) that adds a
>         goroutine
>         > that will dump the gocheck log just before the test gets
>         killed due to
>         > timeout - perhaps a minute before. Not sure if we have access
>         to the
>         > timeout or not, but we can at least make a sensible guess.
>         >
>         > This would give us at least some logging to work through on these
>         > situations where the test is getting killed due to running too
>         long.
>         >
>         > If no one looks at this and fixes it overnight, I'll start
>         poking it
>         > with a long stick tomorrow.
>         >
>         > Cheers,
>         > Tim
>         >
>         > --
>         > Juju-dev mailing list
>         > Juju-dev at lists.ubuntu.com <mailto:Juju-dev at lists.ubuntu.com>
>         > Modify settings or unsubscribe at:
>         https://lists.ubuntu.com/mailman/listinfo/juju-dev
> 
>         --
>         Juju-dev mailing list
>         Juju-dev at lists.ubuntu.com <mailto:Juju-dev at lists.ubuntu.com>
>         Modify settings or unsubscribe at:
>         https://lists.ubuntu.com/mailman/listinfo/juju-dev
> 
> 
> 




More information about the Juju-dev mailing list