HELP NEEDED: peergrouper tests very unreliable with golang 1.5

roger peppe rogpeppe at gmail.com
Mon Nov 16 09:27:33 UTC 2015


See also https://github.com/golang/go/issues/11513


On 16 November 2015 at 01:37, Andrew Wilkins
<andrew.wilkins at canonical.com> wrote:
> On Mon, Nov 16, 2015 at 9:33 AM Andrew Wilkins
> <andrew.wilkins at canonical.com> wrote:
>>
>> On Tue, Nov 10, 2015 at 12:18 PM Tim Penhey <tim.penhey at canonical.com>
>> wrote:
>>>
>>> Hi folks,
>>>
>>> After spending three hours on this and getting nowhere, I feel it is
>>> time for me to throw in the towel and let someone else who knows more
>>> about the peergrouper pick this up.
>>>
>>> One of the changes that golang 1.5 made was to change the default for
>>> GOMAXPROCS [1]. This causes new fun races in our tests.
>>>
>>> A key problem with the peergrouper is around assumptions that it knows
>>> that is going on :-)
>>>
>>> The peergrouper has a main loop that starts a watcher for state server
>>> changes.
>>>
>>> The state server changes watcher gets notified that there are some known
>>> state servers. This then signals to the primary peergrouper through a
>>> function channel. The peergrouper main loop then runs this function.
>>>
>>> The state server notify function then tries to start machine watchers
>>> for each of the machines that are currently state servers.
>>>
>>> Each of these machine watchers starts its own goroutine to watch for
>>> changes, and when it gets some, passes a refresh function to to the
>>> peergrouper main loop using the same notify function channel.
>>>
>>> Now there are a few races here:
>>>  1) when the peergrouper main loop gets a notify function, it resets the
>>> timer with 0, which should immediately signal the timer channel. It is
>>> possible that this timer channel will fire before the machine refresh
>>> notify functions are called. I attempted to fix this by changing the
>>> Reset to use time.Millisecond.
>>>  2) When the timer is signaled, it calls desiredPeerGroup. This however
>>> fails if it has machines with a nil Vote, in the extras (which are
>>> machines that aren't part of the current replicaset). This happens when
>>> the timer is signaled before the machine loops have updated themselves.
>>>
>>> I attempted to fix some of these in a branch off master [2], however the
>>> tests still fail in other places under load. Like here:
>>>
>>> worker_test.go:268:
>>>     c.Fatalf("timed out waiting for vote to be set")
>>> ... Error: timed out waiting for vote to be set
>>>
>>> But the underlying problem was:
>>>
>>> [LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop
>>> terminated: cannot compute desired peer group: voting non-machine member
>>> replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234",
>>> Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil),
>>> Priority:(*float64)(nil),
>>> Tags:map[string]string{"juju-machine-id":"11"},
>>> SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group
>>> worker_test.go:271:
>>>
>>> Which comes from the desiredPeerGroup function when there is a nil Votes
>>> in an extra machine (which you can see is the case).
>>>
>>> However I don't grok why this is happening in the test. Full test output
>>> here:  http://paste.ubuntu.com/13213601/
>>>
>>> There are definitely races here, but for some of the failures, I can't
>>> tell if it is the peergrouper itself, or the complex mocks being used.
>>>
>>> So... I'm hoping someone who groks this more will pick up where I've
>>> left off.
>>
>>
>> See below for what not to do with timers.
>>
>> I spent some time on Friday afternoon looking at this, and Tim has
>> verified that it fixes the issue for him. The tests were flaky because of
>> how timers were started/stopped/selected-on:
>>
>>
>> https://github.com/axw/juju/commit/4d1099ccd9b459e3da94c6595e2a7b5e945c3f87#diff-0fdda5d614954ebabbd3194ed9ef7f20L162
>
>
> Sorry, I should have actually explained rather than just pointed at code.
>
> time.Timer.Stop will undo an already-triggered timer event. So if you do
> this:
>     t := time.NewTimer(0)
>     t.Stop()
> then the t.C channel may or may not have a value in it. The Stop() method
> returns a boolean that tells you the answer to that question:
> https://godoc.org/time#Timer.Stop.
>
> My solution was just to do what we do elsewhere, and start out with a nil
> channel and only set it to something when we want to wait on a time event.
>
>> Cheers,
>> Andrew
>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>



More information about the Juju-dev mailing list