Investigating transient failure
John Arbash Meinel
john at arbash-meinel.com
Tue Jul 9 06:57:31 UTC 2013
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
We just had go-bot bounce a patch because of a transient failure:
https://code.launchpad.net/~ack/juju-core/uuid-in-environment-info/+merge/173485/comments/388115
I really want our test suite to be reliable for the bot, having to
resubmit due to transient is pretty bad. As transient failures tend to
build up over time, and you end up with what happened with Mongo (test
suite "transiently" failed 3 out of 4 times.)
There were actually 2 failures in that, but I think they are different
causes.
The first failure I'm pretty concerned about, because it seems likely
to reoccur:
FAIL: machine_test.go:571: MachineSuite.TestWatchMachine
...
machine_test.go:607:
testing.NewNotifyWatcherC(c, s.State, w).AssertOneChange()
testing/watcher.go:65:
c.Fatalf("watcher sent unexpected change: (_, %v)", ok)
... Error: watcher sent unexpected change: (_, true)
Note that the failure is that we *got* an event when we didn't expect
one. The timeout on that function is shortTime == 50ms.
Given the traceback, it would appear that we call AssertOneChange, but
we ended up getting 2 changes. (I'm slightly concerned that
AssertOneChange isn't showing up in the failure traceback.)
- From what I can see, this is some sort of failure-to-coalesce 2
changes into 1 event. (EnsureDead + Remove is supposed to only trigger
1 watch event.)
Does Remove happen asynchronously, so it might not have finished
triggering by the time we start watching the Machine?
We do have the "initial event" which always triggers, so really we
have 3 events going on, and we assume that we'll only see that as 1
actual trigger.
Remove is happening in a Transaction, but my understanding of Mongo is
that may not actually be applied to the DB by the time the function
returns? In which case the Remove would show up after we've started
watching.
Is there something we can do to prevent the race?
The second failure is a "timed out" failure in
worker/uniter/filter_test.go
We waited 50ms to get a ConfigEvents() change. The things that strike
me here are:
a) We should be using state/testing/NotifyWatcherC, or its equivalent,
rather than inline functions that are doing essentially the same
thing. I realize this is slightly special case because it isn't
actually a Watcher.
ConfigEvents *is* a "chan struct{}" though so we could probably get
some sort of shared helper.
b) The NotifyWatcherC code uses a timeout of 500ms for "things which
we expect to happen immediately". Versus 50ms for "things we expect to
timeout".
So we can probably just bump up the timeout on that particular code,
though as mentioned I'd really like to see helpers here.
Thoughts?
I'll try to put a patch together that at least fixes up the
filter_test.go code. I don't feel like I fully understand the
machine_test.go failure.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlHbtFsACgkQJdeBCYSNAAM9EwCeLHuSZotfgkHvaSOskuA7NLSJ
TPwAnjIzxEXlV1uct5G+tRY8GIQ9XIDc
=u9Ub
-----END PGP SIGNATURE-----
More information about the Juju-dev
mailing list