I'm concerned

roger peppe roger.peppe at canonical.com
Thu Jun 18 14:06:15 UTC 2015


I'll also point out that there also exists a package
that can be appropriate for this style of thing
(I'm not sure whether it would be appropriate
for this particular case).

It makes it straightforward to create an object
that can be watched for changes and changed
without any rendezvous with the watchers.

http://godoc.org/github.com/juju/utils/voyeur

On 18 June 2015 at 14:08, William Reade <william.reade at canonical.com> wrote:
> Good catch.
>
> Everyone, please, remember: naked channel ops are basically bad and wrong
> and dangerous. You have to completely trust every other involved component
> to act exactly as you expect, and that's unjustifiably optimistic in all but
> the most tightly scoped scenarios.
>
> Cheers
> William
>
> On Thu, Jun 18, 2015 at 6:12 AM, Tim Penhey <tim.penhey at canonical.com>
> wrote:
>>
>> OK, found it.  And it has nothing to do with leases.
>>
>> I'm just proposing the fix now, but it has taken me most of the day to
>> diagnose and fix.
>>
>> The certupdater worker was making the mistake of trusting a watcher. It
>> was blindly getting the addresses and updating the certificate. The
>> cases where the agent failed to stop was when for some reason, the
>> address watcher fired twice after the apiserver worker had shut down,
>> but before the certupdater worker was signalled to die (or before it
>> noticed). The certupdater worker communicates with the apiserver worker
>> through a buffered channel (with a one item buffer).  It was the second
>> notification that triggered the blocking channel send.
>>
>> I added a memory to the cert updater, so it doesn't blindly update the
>> cert, but only when the addresses do in fact change.
>>
>> I had a failure rate of between 20 and 40% before this change, and it
>> appears to be fixed now.
>>
>> Tim
>>
>> On 17/06/15 22:01, William Reade wrote:
>> > ...but I think that axw actually addressed that already. Not sure then;
>> > don't really have the bandwidth to investigate deeply right now. Sorry
>> > noise.
>> >
>> > On Wed, Jun 17, 2015 at 10:52 AM, William Reade
>> > <william.reade at canonical.com <mailto:william.reade at canonical.com>>
>> > wrote:
>> >
>> >     I think the problem is in the implicit apiserver->leasemgr->state
>> >     dependencies; if the lease manager is stopped at the wrong moment,
>> >     the apiserver will never shut down because it's waiting on a blocked
>> >     leasemgr call. I'll propose something today.
>> >
>> >     On Wed, Jun 17, 2015 at 7:33 AM, David Cheney
>> >     <david.cheney at canonical.com <mailto:david.cheney at canonical.com>>
>> > wrote:
>> >
>> >         This should be achievable. go test sends SIGQUIT on timeout, we
>> > can
>> >         setup a SIGQUIT handler in the topmost suite (or import it as a
>> > side
>> >         effect package), do whatever cleanup is needed, then os.Exit,
>> >         unhandle
>> >         the signal and try to send SIGQUIT to ourselves, or just panic.
>> >
>> >         On Wed, Jun 17, 2015 at 3:25 PM, Tim Penhey
>> >         <tim.penhey at canonical.com <mailto:tim.penhey at canonical.com>>
>> > wrote:
>> >         > Hey team,
>> >         >
>> >         > I am getting more and more concerned about the length of time
>> > that
>> >         > master has been cursed.
>> >         >
>> >         > It seems that sometime recently we have introduced serious
>> >         instability
>> >         > in cmd/jujud/agent, and it is often getting wedged and killed
>> >         by the
>> >         > test timeout.
>> >         >
>> >         > I have spent some time looking, but I have not yet found a
>> >         definitive
>> >         > cause.  At least some of the time the agent is failing to stop
>> >         and is
>> >         > deadlocked.
>> >         >
>> >         > This is an intermittent failure, but intermittent enough that
>> >         often at
>> >         > least one of the unit test runs fails with this problem
>> >         cursing the
>> >         > entire run.
>> >         >
>> >         > One think I have considered to aid in the debugging is to add
>> >         some code
>> >         > to the juju base suites somewhere (or in testing) that adds a
>> >         goroutine
>> >         > that will dump the gocheck log just before the test gets
>> >         killed due to
>> >         > timeout - perhaps a minute before. Not sure if we have access
>> >         to the
>> >         > timeout or not, but we can at least make a sensible guess.
>> >         >
>> >         > This would give us at least some logging to work through on
>> > these
>> >         > situations where the test is getting killed due to running too
>> >         long.
>> >         >
>> >         > If no one looks at this and fixes it overnight, I'll start
>> >         poking it
>> >         > with a long stick tomorrow.
>> >         >
>> >         > Cheers,
>> >         > Tim
>> >         >
>> >         > --
>> >         > Juju-dev mailing list
>> >         > Juju-dev at lists.ubuntu.com <mailto:Juju-dev at lists.ubuntu.com>
>> >         > Modify settings or unsubscribe at:
>> >         https://lists.ubuntu.com/mailman/listinfo/juju-dev
>> >
>> >         --
>> >         Juju-dev mailing list
>> >         Juju-dev at lists.ubuntu.com <mailto:Juju-dev at lists.ubuntu.com>
>> >         Modify settings or unsubscribe at:
>> >         https://lists.ubuntu.com/mailman/listinfo/juju-dev
>> >
>> >
>> >
>>
>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>



More information about the Juju-dev mailing list