HELP NEEDED: peergrouper tests very unreliable with golang 1.5

Mon Nov 16 01:33:51 UTC 2015

On Tue, Nov 10, 2015 at 12:18 PM Tim Penhey <tim.penhey at canonical.com>
wrote:

> Hi folks,
>
> After spending three hours on this and getting nowhere, I feel it is
> time for me to throw in the towel and let someone else who knows more
> about the peergrouper pick this up.
>
> One of the changes that golang 1.5 made was to change the default for
> GOMAXPROCS [1]. This causes new fun races in our tests.
>
> A key problem with the peergrouper is around assumptions that it knows
> that is going on :-)
>
> The peergrouper has a main loop that starts a watcher for state server
> changes.
>
> The state server changes watcher gets notified that there are some known
> state servers. This then signals to the primary peergrouper through a
> function channel. The peergrouper main loop then runs this function.
>
> The state server notify function then tries to start machine watchers
> for each of the machines that are currently state servers.
>
> Each of these machine watchers starts its own goroutine to watch for
> changes, and when it gets some, passes a refresh function to to the
> peergrouper main loop using the same notify function channel.
>
> Now there are a few races here:
>  1) when the peergrouper main loop gets a notify function, it resets the
> timer with 0, which should immediately signal the timer channel. It is
> possible that this timer channel will fire before the machine refresh
> notify functions are called. I attempted to fix this by changing the
> Reset to use time.Millisecond.
>  2) When the timer is signaled, it calls desiredPeerGroup. This however
> fails if it has machines with a nil Vote, in the extras (which are
> machines that aren't part of the current replicaset). This happens when
> the timer is signaled before the machine loops have updated themselves.
>
> I attempted to fix some of these in a branch off master [2], however the
> tests still fail in other places under load. Like here:
>
> worker_test.go:268:
>     c.Fatalf("timed out waiting for vote to be set")
> ... Error: timed out waiting for vote to be set
>
> But the underlying problem was:
>
> [LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop
> terminated: cannot compute desired peer group: voting non-machine member
> replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234",
> Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil),
> Priority:(*float64)(nil),
> Tags:map[string]string{"juju-machine-id":"11"},
> SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group
> worker_test.go:271:
>
> Which comes from the desiredPeerGroup function when there is a nil Votes
> in an extra machine (which you can see is the case).
>
> However I don't grok why this is happening in the test. Full test output
> here:  http://paste.ubuntu.com/13213601/
>
> There are definitely races here, but for some of the failures, I can't
> tell if it is the peergrouper itself, or the complex mocks being used.
>
> So... I'm hoping someone who groks this more will pick up where I've
> left off.
>

See below for what not to do with timers.

I spent some time on Friday afternoon looking at this, and Tim has verified
that it fixes the issue for him. The tests were flaky because of how timers
were started/stopped/selected-on:

https://github.com/axw/juju/commit/4d1099ccd9b459e3da94c6595e2a7b5e945c3f87#diff-0fdda5d614954ebabbd3194ed9ef7f20L162

Cheers,
Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20151116/3feffc77/attachment.html>