Performance statistics aggregation

Mark Seger mjseger at gmail.com
Sat Jun 18 12:26:18 UTC 2011


> Some distributions have used SAR, which is part of sysstat. Other
> lightweight solutions exists, like collectl (L and not D) which lives at
> http://collectl.sourceforge.net. Those two only take care of collecting
> the data and do nothing about displaying it.

As the author of collectl, I have some thoughts.  First and foremost collectl 
DOES do a lot about displaying data and provides a number of different formats.  
If you include the collectl-utils package, also on sourceforge, it provides a 
comprehensive web-based plotting tool called colplot.  It also provides an 
aggregater called colmux which allows you run aggregrate/sort data from many 
systems both realtime and historical.  I've run this on over 1000 nodes and 
easily could see which nodes were using the most slab memory or had the busiest 
disks.  You can sort of literally anything collectl can collectl.

Another focus of collectl it the ability to supply/integrate data for other 
tools.  I know of one site running a 2300 node ganglia cluster. They get ALL 
their data from collectl which talks directly to gmetad over a UDP socket, which 
sends a subset up to ganglia while keeps the deeper detailed data locally, since 
at 10 second sampling it would overwhelm ganglia.

Let's also not forget the breadth of data collectl collects including 
InfiniBand, which I think is still one of the only tools that does that.  And 
all this at less than 0.1% of the CPU.

If this still isn't enough functionality, one can also write their own data 
collection modules, for example one I just released with the latest version that 
can monitor nvidia GPUs.

There were also previous comments in this thread about ganglia and the question 
was never raised about plotting data via RRD, which is what ganglia does 
natively.  I'm the first to agree this plots look very good, but at the same 
time they do too much normalization for me to make them useful.  If ganglia/rrd 
tells me my network is cruising along at 30% I might be feeling pretty good, but 
if I plot the actual data will colplot I might see multi-second spikes of 100%, 
not a good thing.  Just be warned...

-mark






More information about the ubuntu-server mailing list