Notes from Scale testing

Wed Oct 30 14:18:30 UTC 2013

Hi John,

This is awesome, its great to see this scale testing and analysis. Some
additional questions/comments inline.

On Wed, Oct 30, 2013 at 6:23 AM, John Arbash Meinel
<john at arbash-meinel.com>wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I'm trying to put together a quick summary of what I've found out so
> far with testing juju in an environment with thousands (5000+) agents.
>
>
> 1) I didn't ever run into problems with connection failures due to
> socket exhaustion. The default upstart script we write for jujud has
> "limit nofile 20000 20000" and we seem to properly handle that 1 agent
> == 1 connection. (vs the old 1 agent = >=2 mongodb connections).
>
>
> 2) Agents seem to consume about 17MB resident according to 'top'. That
> should mean we can run ~450 agents on an m1.large. Though in my
> testing I was running ~450 and still had free memory, so I'm guessing
> there might be some copy-on-write pages (17MB is very close to the
> size of the jujud binary).
>

> 3) On the API server, with 5k active connections resident memory was
> 2.2G for jujud (about 400kB/conn), and only about 55MB for mongodb. DB
> size on disk was about 650MB.
>
> The log file could grow pretty big (up to 2.5GB once everything was up
> and running though it does compress to 200MB), but I'll come back to
> that later.
>

The log size didn't come up again in this email. Not sure if you meant
separately or just got lost in the message length.

>
> Once all the agents are up and running, they actually are very quiet
> (almost 0 log statements).
>
>
> 4) If I bring up the units one by one (for i in `seq 500`; do for j in
> `seq 10` do juju add-unit --to $j &; time wait; done), it ends up
> triggering O(N^2) behavior in the system. Each unit agent seems to
> have a watcher for other units of the same service. So when you add 1
> unit, it wakes up all existing units to let them know about it. In
> theory this is on a 5s rate limit (only 1 wakeup per 5 seconds). In
> practice it was taking >3s per add unit call [even when requesting
> them in parallel]. I think this was because of the load on the API
> server of all the other units waking up and asking for details at the
> same time.
>
>

> - From what I can tell, all units take out a watch on their service so
> that they can monitor its Life and CharmURL. However, adding a unit to
> a service triggers a change on that service, even though Life and
> CharmURL haven't changed. If we split out Watching the
> units-on-a-service from the lifetime and URL of a service, we could
> avoid the thundering N^2 herd problem while starting up a bunch of
> units. Though UpgradeCharm is still going to thundering herd.
>
> Response in log from last "AddServiceUnits" call:
> http://paste.ubuntu.com/6329753/
>
> Essentially it triggers 700 calls to Service.Life and CharmURL (I
> think at this point one of the 10 machines wasn't responding, so it
> was <1k Units running)
>
>
> 5) Along with load, we weren't caching the IP address of the API
> machine, which caused us to read the provider-state file from object
> storage and then ask EC2 for the IP address of that machine.
> Log of 1 unit agent's connection: http://paste.ubuntu.com/6329661/

Just to be clear for other readers (wasn't clear to me without checking the
src)  this isn't the agent resolving the api server address from
provider-state which would mean provider credentials available to each
agent, but each agent periodically requesting via the api the address of
the api servers. So the cache here is on the api server.

>
> Eventually while starting up the Unit agent would make a request for
> APIAddresses (I believe it puts that information into the context for
> hooks that it runs). Occasionally that request gets rate limited by EC2.
> When that request fails it triggers us to stop the
>   "WatchServiceRelations"
>   "WatchConfigSettings"
>   "Watch(unit-ubuntu-4073)" # itself
>   "Watch(service-ubuntu)"   # the service it is running
>
> It then seems to restart the Unit agent, which goes through the steps
> of making all the same requests again. (Get the Life of my Unit, get
> the Life of my service, get the UUID of this environment, etc., there
> are 41 requests before it gets to APIAddress)
>

>
> 6) If you restart jujud (say after an upgrade) it causes all unit
> agents to restart the 41 requests for startup. This seems to be rate
> limited by the jujud process (up to 600% CPU) and a little bit Mongo
> (almost 100% CPU).
>
> It seems to take a while but with enough horsepower and GOMAXPROCS
> enabled it does seem to recover (IIRC it took about 20minutes).
>

It might be worth exploring how we do upgrades to keep the client socket
open (ala nginx) to avoid the extra thundering herd on restart, ie
serialize extant watch state and exec with open fds. Upgrade is effectively
already triggering a thundering herd with the agents as they restart
individually, and then the api server restart does a restart for another
herd.

There's also an extant bug  that restart of juju agents causes
unconditional config-changed hook execution even if there is no delta on
config to the unit.

>
> 7) If I "juju deploy nrpe-external-master; juju add-relation ubuntu
> nrpe-external-master", very shortly thereafter "juju status" reports
> all agents (machine and unit agents) as "agent-state: down". Even the
> machine-0 agent. Given I was already close to capacity for even the
> unit machines there could be any sort of problem here. I would like to
> try another test where we are a bit farther away from capacity.
>
>
> 8) We do end up CPU throttled fairly often (especially if we don't set
> GOMAXPROCS). It is probably worth spending some time profiling what
> jujud is doing. I have the feeling all of those calls to CharmURL are
> triggering DB reads from Mongo, which is a bit inefficient.
>
> I would be fine doing max(1, NumCPUs()-1) or something similar. I'd
> rather do it inside jujud rather than in the cloud-init script,
> because computing NumCPUs is easier there. But we should have *a* way
> to scale up the central node that isn't just scaling out to more API
> servers.
>
> 9) We also do seem to hit MongoDB limits. I ended up at 100% CPU for
> mongod, and I certainly was never above 100%. I didn't see any way to
> configure mongo to use more CPU. I wonder if it is limited to 1 CPU
> per connection, or if it is just always 1 CPU.
>
> I certainly think we need a way to scale Mongo as well. If it is just
> 1 CPU per connection then scaling horizontally with API servers should
> get us around that limit.
>
> 10) Allowing "juju add-unit -n 100 --to X" did make things a lot
> easier to bring up. Though it still takes a while for the request to
> finish. It felt like the api call triggered work to start happening in
> the background which made the current api call take longer to finally
> complete. (as in, minutes once we had >1000 units).
>
> I generally went
>   juju deploy ubuntu -n 10
>   # grow to 100
>   for i in `seq 10`; do juju add-unit -n 9 --to $i & done; time wait
>   # grow to 1000
>   for i in `seq 10`; do juju add-unit -n 90 --to ...
>   # grow to 5000
>   for i in `seq 10`; do juju add-unit -n 400 --to ...
>
> The branch with my patches is available at:
>   lp:~jameinel/juju-core/scale-testing
>
> Not everything in there is worth landing in trunk (rudimentary API
> caching, etc).
>
> That's all I can think of for now, though I think there is more to be
> explored.
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.13 (Cygwin)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlJxCF8ACgkQJdeBCYSNAAOL1gCeNWP1G7a6UaJ1iNxT8HB7RpQo
> IiUAniGX4CGLwXFUBFNwbFojubvpXUER
> =4dAx
> -----END PGP SIGNATURE-----
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20131030/200a8fde/attachment-0001.html>