Performance statistics aggregation
Mark Seger
mjseger at gmail.com
Sat Jun 18 12:26:18 UTC 2011
> Some distributions have used SAR, which is part of sysstat. Other
> lightweight solutions exists, like collectl (L and not D) which lives at
> http://collectl.sourceforge.net. Those two only take care of collecting
> the data and do nothing about displaying it.
As the author of collectl, I have some thoughts. First and foremost collectl
DOES do a lot about displaying data and provides a number of different formats.
If you include the collectl-utils package, also on sourceforge, it provides a
comprehensive web-based plotting tool called colplot. It also provides an
aggregater called colmux which allows you run aggregrate/sort data from many
systems both realtime and historical. I've run this on over 1000 nodes and
easily could see which nodes were using the most slab memory or had the busiest
disks. You can sort of literally anything collectl can collectl.
Another focus of collectl it the ability to supply/integrate data for other
tools. I know of one site running a 2300 node ganglia cluster. They get ALL
their data from collectl which talks directly to gmetad over a UDP socket, which
sends a subset up to ganglia while keeps the deeper detailed data locally, since
at 10 second sampling it would overwhelm ganglia.
Let's also not forget the breadth of data collectl collects including
InfiniBand, which I think is still one of the only tools that does that. And
all this at less than 0.1% of the CPU.
If this still isn't enough functionality, one can also write their own data
collection modules, for example one I just released with the latest version that
can monitor nvidia GPUs.
There were also previous comments in this thread about ganglia and the question
was never raised about plotting data via RRD, which is what ganglia does
natively. I'm the first to agree this plots look very good, but at the same
time they do too much normalization for me to make them useful. If ganglia/rrd
tells me my network is cruising along at 30% I might be feeling pretty good, but
if I plot the actual data will colplot I might see multi-second spikes of 100%,
not a good thing. Just be warned...
-mark
More information about the ubuntu-server
mailing list