Scale testing analysis

Tue May 23 03:52:59 UTC 2017

Hi folks,

We had another scale test today to analyse why the controller CPU usage 
didn't fall away as expected when the models were removed.

I'll be filing a bunch of bugs from the analysis process, but there is 
one bug that is, I believe, the culprit for the high CPU usage.

Interestingly enough, Juju developers were not able to reproduce the 
problem with smaller deployments. The scale that we were testing was 140 
models each with 10 machines and about 20 total units.

During the teardown process of the testing, all models were destroyed at 
once.

We have most of the responsive nature of Juju is driven off the 
watchers. These watchers watch the mongo oplog for document changes. 
What happened was that there were so many mongo operations, the capped 
collection of the oplog was completely replaced between our polled 
watcher delays. The watchers then errored out in a new unexpected way.

Effectively the watcher infrastructure needs an internal reset button 
that it can hit when this happens that invalidates all the watchers. 
This should cause all the workers to be torn down and restarted from a 
known good state.

There was a model that got stuck being destroyed, this is tracked back 
to a worker that should be doing the destructions not noticing.

All the CPU usage can be tracked back to the 139 models in the apiserver 
state pools each still running leadership and base watcher workers. The 
state pool should have removed all these instances, but it didn't notice 
they were gone.

There are some other bugs around logging things as errors that really 
aren't errors that contributed to log noise, but the fundamental error 
here is not being robust in the face of too much change at once.

This needs to be fixed for the 2.2 release candidate, so it may well 
push that out past the end of this week.

Tim