Performance statistics aggregation

Sat Jun 18 17:12:56 UTC 2011

Excerpts from Mark Seger's message of Sat Jun 18 05:26:18 -0700 2011:
> 
> > Some distributions have used SAR, which is part of sysstat. Other
> > lightweight solutions exists, like collectl (L and not D) which lives at
> > http://collectl.sourceforge.net. Those two only take care of collecting
> > the data and do nothing about displaying it.
> 
> As the author of collectl, I have some thoughts.  First and foremost collectl 
> DOES do a lot about displaying data and provides a number of different formats.  
> If you include the collectl-utils package, also on sourceforge, it provides a 
> comprehensive web-based plotting tool called colplot.  It also provides an 
> aggregater called colmux which allows you run aggregrate/sort data from many 
> systems both realtime and historical.  I've run this on over 1000 nodes and 
> easily could see which nodes were using the most slab memory or had the busiest 
> disks.  You can sort of literally anything collectl can collectl.
> 
> Another focus of collectl it the ability to supply/integrate data for other 
> tools.  I know of one site running a 2300 node ganglia cluster. They get ALL 
> their data from collectl which talks directly to gmetad over a UDP socket, which 
> sends a subset up to ganglia while keeps the deeper detailed data locally, since 
> at 10 second sampling it would overwhelm ganglia.
> 

Mark wow thats pretty awesome... now I'm quite interested in collectl
as I made a brief attempt to create something like this about a year ago.

I'm curious about the I/O impact that collectl has. One thing that
tends to crush RRD based systems is the amount of random I/O needed to
record the data.  The caching daemon added in recent versions helps by
aggregating syncs and writes so they're more linear. What does collectl
do and how durable is the data it collects?

To contrast what you've said collectl does, sysstat just takes a snapshot
every 10 minutes, and isn't very painful to write out because its just
a few hundred integers and floats at the most.

One interesting trend I've seen also is to have individual nodes write
to a local log file without syncing, and let a lazy writer send those
to a centralized machine for safer storage and/or aggregation.

Anyway, I do think it would be cool to have something like this enabled
by default, but only if it truly is less than 1% of total system resources
(not just CPU).