Scale testing analysis

Tue May 23 05:01:08 UTC 2017

>
> ...
>

> We have most of the responsive nature of Juju is driven off the watchers.
> These watchers watch the mongo oplog for document changes. What happened
> was that there were so many mongo operations, the capped collection of the
> oplog was completely replaced between our polled watcher delays. The
> watchers then errored out in a new unexpected way.
>
> Effectively the watcher infrastructure needs an internal reset button that
> it can hit when this happens that invalidates all the watchers. This should
> cause all the workers to be torn down and restarted from a known good state.
>

Tim and I discussed this a bit. It probably wasn't the 'oplog' that
overflowed, but actually the 'txns.log' collection. Which is also a capped
collection at 10MB in size.
The issue is likely that the 'txnsLogWorker' automatically restarted on an
error, but the error actually meant that we're missing events, which means
that all the watchers/workers that are relying on the event stream should
be restarted. (we obviously can't know what events we're missing, cause
they're missing.)

So one argument is that txnsLogWorker should *not* be automatically
restarted. Instead failures of that worker should actually be critical
failures in the process and just cause the whole process to restart.
The alternative is that we introduce a mechanism to cause all workers to
restart (since they need to start fresh anyway), but restarting the agent
has a similar effect.

It is possible that we could whitelist some known errors that don't
indicate we need a full restart, but those really should be a whitelist.

John
=:->

>
> There was a model that got stuck being destroyed, this is tracked back to
> a worker that should be doing the destructions not noticing.
>
> All the CPU usage can be tracked back to the 139 models in the apiserver
> state pools each still running leadership and base watcher workers. The
> state pool should have removed all these instances, but it didn't notice
> they were gone.
>
> There are some other bugs around logging things as errors that really
> aren't errors that contributed to log noise, but the fundamental error here
> is not being robust in the face of too much change at once.
>
> This needs to be fixed for the 2.2 release candidate, so it may well push
> that out past the end of this week.
>
> Tim
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailm
> an/listinfo/juju-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20170523/4ff7b867/attachment-0001.html>