Notes from Scale testing

Wed Oct 30 15:02:36 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> 
> 4) If I bring up the units one by one (for i in `seq 500`; do for j
> in `seq 10` do juju add-unit --to $j &; time wait; done), it ends
> up triggering O(N^2) behavior in the system. Each unit agent seems
> to have a watcher for other units of the same service. So when you
> add 1 unit, it wakes up all existing units to let them know about
> it.
> 
> 
> I tried to talk about this in the hangout this morning, but I'm not
> sure if I got my point across.  I don't know that this really
> qualifies as N^2 given that no single machine sends or receives
> more than N messages. The network takes an N^2 hit. It's really
> only O(N) per unit agent.  It might be N^2 for the state server if
> each agent pings the state server when it receives the unit-add
> message... but it seems unlikely that we'd do that (and if we do,
> we should fix that).
> 

Adding 1000 units to a service triggers 1000*1000 (*4) lines written
to the log file. And that is 1M requests against the API server. Seems
N^2 to me.

All Units Watch for changes in their Service. When you add a unit, it
triggers the units to ask the API server what the CharmURL and
Service.Life is.

So yes, it is N^2, I have the 2GB log file if you want to investigate. :)

I did paste a couple snippets, where you can see adding 1 unit caused
several hundred requests for CharmURL to come back.

> 
> 
> 8) We do end up CPU throttled fairly often (especially if we don't
> set GOMAXPROCS). It is probably worth spending some time profiling
> what jujud is doing. I have the feeling all of those calls to
> CharmURL are triggering DB reads from Mongo, which is a bit
> inefficient.
> 
> I would be fine doing max(1, NumCPUs()-1) or something similar.
> I'd rather do it inside jujud rather than in the cloud-init
> script, because computing NumCPUs is easier there. But we should
> have *a* way to scale up the central node that isn't just scaling
> out to more API servers.
> 
> 
> It seems as though GOMAXPROCS = NumCPUs is probably better, and
> just let the OS handle scheduling.
> 

I'm happy for that on the API servers, though I would consider always
leaving some free space for Mongo. However, I would consider
throttling more for non-JobHostsState machines, given they are serving
user workloads. (They also shouldn't ever be trying to do as much work
as the state server nodes, though.)
...

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJxH4wACgkQJdeBCYSNAAPKpgCgtEXgZhxZFflodfeCXbhc9lU1
orsAn0jrbN/dyBUs2VPskYjR+0qknmDl
=EdO3
-----END PGP SIGNATURE-----