HELP NEEDED: peergrouper tests very unreliable with golang 1.5
Tim Penhey
tim.penhey at canonical.com
Tue Nov 10 04:18:39 UTC 2015
Hi folks,
After spending three hours on this and getting nowhere, I feel it is
time for me to throw in the towel and let someone else who knows more
about the peergrouper pick this up.
One of the changes that golang 1.5 made was to change the default for
GOMAXPROCS [1]. This causes new fun races in our tests.
A key problem with the peergrouper is around assumptions that it knows
that is going on :-)
The peergrouper has a main loop that starts a watcher for state server
changes.
The state server changes watcher gets notified that there are some known
state servers. This then signals to the primary peergrouper through a
function channel. The peergrouper main loop then runs this function.
The state server notify function then tries to start machine watchers
for each of the machines that are currently state servers.
Each of these machine watchers starts its own goroutine to watch for
changes, and when it gets some, passes a refresh function to to the
peergrouper main loop using the same notify function channel.
Now there are a few races here:
1) when the peergrouper main loop gets a notify function, it resets the
timer with 0, which should immediately signal the timer channel. It is
possible that this timer channel will fire before the machine refresh
notify functions are called. I attempted to fix this by changing the
Reset to use time.Millisecond.
2) When the timer is signaled, it calls desiredPeerGroup. This however
fails if it has machines with a nil Vote, in the extras (which are
machines that aren't part of the current replicaset). This happens when
the timer is signaled before the machine loops have updated themselves.
I attempted to fix some of these in a branch off master [2], however the
tests still fail in other places under load. Like here:
worker_test.go:268:
c.Fatalf("timed out waiting for vote to be set")
... Error: timed out waiting for vote to be set
But the underlying problem was:
[LOG] 0:00.013 ERROR juju.worker.peergrouper peergrouper loop
terminated: cannot compute desired peer group: voting non-machine member
replicaset.Member{Id:1, Address:"[2001:DB8::11]:1234",
Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil),
Priority:(*float64)(nil),
Tags:map[string]string{"juju-machine-id":"11"},
SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)} found in peer group
worker_test.go:271:
Which comes from the desiredPeerGroup function when there is a nil Votes
in an extra machine (which you can see is the case).
However I don't grok why this is happening in the test. Full test output
here: http://paste.ubuntu.com/13213601/
There are definitely races here, but for some of the failures, I can't
tell if it is the peergrouper itself, or the complex mocks being used.
So... I'm hoping someone who groks this more will pick up where I've
left off.
Cheers,
Tim
[1]
https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3g5mATSXqqV5QrxinasI/edit
[2] https://github.com/howbazaar/juju/tree/peergrouper-test-fix
More information about the Juju-dev
mailing list