Scale testing analysis
Tim Penhey
tim.penhey at canonical.com
Tue May 23 03:52:59 UTC 2017
Hi folks,
We had another scale test today to analyse why the controller CPU usage
didn't fall away as expected when the models were removed.
I'll be filing a bunch of bugs from the analysis process, but there is
one bug that is, I believe, the culprit for the high CPU usage.
Interestingly enough, Juju developers were not able to reproduce the
problem with smaller deployments. The scale that we were testing was 140
models each with 10 machines and about 20 total units.
During the teardown process of the testing, all models were destroyed at
once.
We have most of the responsive nature of Juju is driven off the
watchers. These watchers watch the mongo oplog for document changes.
What happened was that there were so many mongo operations, the capped
collection of the oplog was completely replaced between our polled
watcher delays. The watchers then errored out in a new unexpected way.
Effectively the watcher infrastructure needs an internal reset button
that it can hit when this happens that invalidates all the watchers.
This should cause all the workers to be torn down and restarted from a
known good state.
There was a model that got stuck being destroyed, this is tracked back
to a worker that should be doing the destructions not noticing.
All the CPU usage can be tracked back to the 139 models in the apiserver
state pools each still running leadership and base watcher workers. The
state pool should have removed all these instances, but it didn't notice
they were gone.
There are some other bugs around logging things as errors that really
aren't errors that contributed to log noise, but the fundamental error
here is not being robust in the face of too much change at once.
This needs to be fixed for the 2.2 release candidate, so it may well
push that out past the end of this week.
Tim
More information about the Juju-dev
mailing list