HELP NEEDED: peergrouper tests very unreliable with golang 1.5
Andrew Wilkins
andrew.wilkins at canonical.com
Mon Nov 16 01:37:20 UTC 2015
On Mon, Nov 16, 2015 at 9:33 AM Andrew Wilkins <andrew.wilkins at canonical.com>
wrote:
> On Tue, Nov 10, 2015 at 12:18 PM Tim Penhey <tim.penhey at canonical.com>
> wrote:
>
>> Hi folks,
>>
>> After spending three hours on this and getting nowhere, I feel it is
>> time for me to throw in the towel and let someone else who knows more
>> about the peergrouper pick this up.
>>
>> One of the changes that golang 1.5 made was to change the default for
>> GOMAXPROCS [1]. This causes new fun races in our tests.
>>
>> A key problem with the peergrouper is around assumptions that it knows
>> that is going on :-)
>>
>> The peergrouper has a main loop that starts a watcher for state server
>> changes.
>>
>> The state server changes watcher gets notified that there are some known
>> state servers. This then signals to the primary peergrouper through a
>> function channel. The peergrouper main loop then runs this function.
>>
>> The state server notify function then tries to start machine watchers
>> for each of the machines that are currently state servers.
>>
>> Each of these machine watchers starts its own goroutine to watch for
>> changes, and when it gets some, passes a refresh function to to the
>> peergrouper main loop using the same notify function channel.
>>
>> Now there are a few races here:
>> 1) when the peergrouper main loop gets a notify function, it resets the
>> timer with 0, which should immediately signal the timer channel. It is
>> possible that this timer channel will fire before the machine refresh
>> notify functions are called. I attempted to fix this by changing the
>> Reset to use time.Millisecond.
>> 2) When the timer is signaled, it calls desiredPeerGroup. This however
>> fails if it has machines with a nil Vote, in the extras (which are
>> machines that aren't part of the current replicaset). This happens when
>> the timer is signaled before the machine loops have updated themselves.
>>
>> I attempted to fix some of these in a branch off master [2], however the
>> tests still fail in other places under load. Like here:
>>
>> worker_test.go:268:
>> c.Fatalf("timed out waiting for vote to be set")
>> ... Error: timed out waiting for vote to be set
>>
>> But the underlying problem was:
>>
>> [LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop
>> terminated: cannot compute desired peer group: voting non-machine member
>> replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234",
>> Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil),
>> Priority:(*float64)(nil),
>> Tags:map[string]string{"juju-machine-id":"11"},
>> SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group
>> worker_test.go:271:
>>
>> Which comes from the desiredPeerGroup function when there is a nil Votes
>> in an extra machine (which you can see is the case).
>>
>> However I don't grok why this is happening in the test. Full test output
>> here: http://paste.ubuntu.com/13213601/
>>
>> There are definitely races here, but for some of the failures, I can't
>> tell if it is the peergrouper itself, or the complex mocks being used.
>>
>> So... I'm hoping someone who groks this more will pick up where I've
>> left off.
>>
>
> See below for what not to do with timers.
>
> I spent some time on Friday afternoon looking at this, and Tim has
> verified that it fixes the issue for him. The tests were flaky because of
> how timers were started/stopped/selected-on:
>
>
> https://github.com/axw/juju/commit/4d1099ccd9b459e3da94c6595e2a7b5e945c3f87#diff-0fdda5d614954ebabbd3194ed9ef7f20L162
>
Sorry, I should have actually explained rather than just pointed at code.
time.Timer.Stop will undo an already-triggered timer event. So if you do
this:
t := time.NewTimer(0)
t.Stop()
then the t.C channel may or may not have a value in it. The Stop() method
returns a boolean that tells you the answer to that question:
https://godoc.org/time#Timer.Stop.
My solution was just to do what we do elsewhere, and start out with a nil
channel and only set it to something when we want to wait on a time event.
Cheers,
> Andrew
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20151116/eb808466/attachment.html>
More information about the Juju-dev
mailing list