Scale Testing: Now with profiling!

John Arbash Meinel john at arbash-meinel.com
Fri Nov 1 12:07:41 UTC 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2013-10-31 11:11, John Arbash Meinel wrote:
> So I managed to instrument a jujud with both CPU and Mem profiling 
> dumps. I then brought up 1000 units and did some poking around.
> 
> The results were actually pretty enlightening.
> 
> 
> 1) Guess what the #1 CPU time was. I know I was surprised: Total:
> 25469 samples 14380  56.5%  56.5%    14404  56.6%
> crypto/sha512.block 1261   5.0%  61.4%     1261   5.0%
> crypto/hmac.(*hmac).tmpPad 1219   4.8%  66.2%    15737  61.8%
> crypto/sha512.(*digest).Write 1208   4.7%  70.9%     9548  37.5%
> crypto/sha512.(*digest).Sum 439   1.7%  72.7%    19046  74.8% 
> launchpad.net/juju-core/thirdparty/pbkdf2
> 

One observation I didn't report. When testing this out, often
machine-0's agent would work for a while, but eventually it would end
up hitting 100% CPU and not getting any other work done. I didn't
notice in Top, but it was actually spending all that time in sys.

So I did some googling and found this:

http://grokbase.com/t/gg/golang-dev/1388yzq7yb/code-review-12183044-syscall-disable-cpu-profiling-around-fork-issue-12183044

Given that I saw some of the hangups at the time we were trying to run
"lxc-ls" it is possible that adding CPU profiling causes Fork to
potentially hang.
https://groups.google.com/forum/#!topic/golang-bugs/9Gyeef14Zaw
http://code.google.com/p/go/issues/detail?id=5517
https://code.google.com/p/gperftools/issues/detail?id=278

So some of the hanging that I saw may not actually be a problem in
practice, but not being able to profile the process seems pretty
unfortunate.

It looks like the fix landed on August 13:
  http://code.google.com/p/go/source/detail?r=9eb1dd061b1f

Which is the day after Golang 1.1.2 was released.


At least that alleviates my fear that when jujud restarts we have a
high probability of hanging permanently. And I got enough profiling to
see that pbkdf2 appears to be the primary cause of slow startup. 70ms
per login * 10,000 agents = 700s ~ 12min or about 3min w/ 4-cpus.

I'm still skeptical that we need pbkdf2 for Agent logins, though I do
like it for user logins. (We are generating 18 character passwords
because originally they were used by Mongo which "only" md5sum'd them.
We could use sha512 and 64-byte/128-hex tokens if we cared.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJzmYwACgkQJdeBCYSNAANS2QCfb+iNU8CNuPKf8Cb94KQNoTjw
ZgkAn23a5RYVhwDvKb2+tJ05aGuQxsQ+
=t1Ia
-----END PGP SIGNATURE-----



More information about the Juju-dev mailing list