Some extensive system health monitoring

Thibaut Varene faucon.millenium at
Tue Mar 8 07:59:30 CST 2005

On Mon, 07 Mar 2005 23:10:53 -0500, John Richard Moser
<nigelenki at> wrote:

> More advanced plug-ins would include monitors to watch memory usage and
> make suggestions to optimize disk cache by tuning swappiness or adding
> more RAM, and warn about the imminent threat of the OOM killer when swap
> and RAM get too full.  A CPU analyzer to notice when X11 apps (which
> should be interactive realtime tasks) are spending way too much time

Hell no. This is not what the RT priority is meant for.

> cranking 100% CPU could suggest a faster CPU, or possibly more RAM if
> the disk cache is very very small.

What about suggesting the user to get some coffee when building a big
source tree? :)

> You get the idea.  The major concern is doing all this "in one place"
> without clutter.  Most of this is just using existing tools (regression
> tests, John the Ripper, smartmontools etc) and gathering data from /proc
> (memory, CPU, disk usage, network throughput) with a pretty interface
> (which must of course be coded during many laboreous hours of hacking).
> System Health Indicator Terminal
> (get me a better name)
> Monitors should be individual plug-ins; you'll notice some of them
> are specific for certain systems, like i.e. PaX tests!
> Note that tests run "at boot" mean run the tests FOR THE BOOT.  Don't
> bother init with actually waiting; just make init start the background
> daemon, which will go about its business running tests in idle time
> etc etc etc.  Passive tests are also PASSIVE, in the background at
> idle time when we really really know nothing else is going on!  We
> want to monitor system health and enhance the user's experience, not
> extend boot time and lag the system.

A few generic remarks about what you suggest:

though providing the user with *useful* info is a good thing,
providing him with *loads* of unexpected, unwanted, incomprehensible
info is a *very bad thing*.

Keep in mind we're not aiming at the average geek, we're aiming at the
average newbie. The geek-type can install gkrellm and get pretty much
everything you're talking about (minus the security regression tests i
suppose). The admin-type would install snmpd and get the same through
MIB database. The average user doesn't care about PAX, network
throughput etc. In fact, not only doesn't he care, most of the time he
can't grasp the meaning of these words... (and believe me, switching a
few computer beginners to linux gave me quite a good insight of what
they care about ;)
> Features:
>  - Task tray icon for easy access
gkrellm and the like
>  - Warnings when system is unhealthy
gkrellm, snmpd and the like
>  - Pluggable monitors for easy expansion and paced development
gkrellm, snmpd and the like

> Monitors:
>  + System Configuration Recovery
>   - Check system configuration in /etc at each successful boot and
>     make a backup tarball

Pretty pointless. Where would you backup that tarball? If the
filesystem gets corrupted, it's likely you can't trust any file stored
on that filesystem. Now remember that we suggest a single partition

>   - Allow separate backup tarball of /etc on demand, isolated from
>     automated backups
>   - Provide for restoration of backup tarball during install and at
>     any arbitrary point
>    - Recovery should allow for four types of recovery:
>     - Replace existing files in /etc with matching files from backup
>     - Restore or augment entire /etc structure with backup
>     - Restore individual /etc functions such as init.d scripts, rc.d
>       runlevels, SQUID configuration, profile, authentication (passwd,
>       shadow, group), apt, "Unknown configurations," etc
>     - Restore individual selected files

This would only be useful either if the disk got corrupted and the
backup has been made on a safe media, or if the user is editing the
config files by himself. In the first case, it's not a background
task, and could be implemented as a userland utility, and in the
latter case, the user is smart enough to backup his files before

If you fear filesystem corruption, see the remark above.

This looks much to me like "automate my routine admin tasks", more
than "provide useful information to the user"...

>  + S.M.A.R.T. monitoring using smartmontools
>   - Warn when disks are faulty and will fail soon

Certainly a good thing. An icon or a message box is enough i think.

>  + Memory size
>   - Warn when too much swap is used

Back in the early days of MacOS <=9, when you were using too much
memory, you'd get a message like "your system is running low on
memory, please consider quitting some running applications". That is
imho the maximal input you should give the user. An experienced user
won't need more input since he can get the info by himself, and a
newbie has just enough data to know: 1) what happens and 2) what to

>    - Subtract off total size of files on tmpfs mounts in calculation
>   - Warn when disk cache drops below X% (possibly 25%) of memory
>    - Suggest more RAM

Telling the user he has sucky hardware doesn't look that good to me... :)

>    - Suggest increasing swappiness if swap is relatively unused

What do you mean? Swap is handled by the kernel VM, you don't want an
average user to dive in and tweak the kernel behaviour, do you? :)

>   - Warn when too much total available memory is used
>    - Combine swap and ram for total available memory
>    - Ignore any disk cache over 25% of physical RAM in usage
>      calculation

This is bloatsome. No need for such complex mechanism imho.
Look at how gnome panel "system monitor" applet computes used RAM: it
doesn't count cache (more precisely it separates it from the output).
This is enough to know how much free memory is available. If the
kernel needs to recover cached memory, it can do it.

>    - Warn when 95% limit reached
>     - Warning message should briefly explain OOM killer:
>       "If memory usage reaches 100%, tasks the OS thinks are unimportant
>        will be terminated automatically.  No opportunity to save your
>        work will be given!"

This kind of message should go along the one I suggested I think.

>    - Suggest terminating tasks with large RSS

hints as to which tasks consume more memory might be interesting. You
would want to filter out all that aren't running with the user's UID
to avoid suggesting killing "X" for instance :) That could be useful
if say the first three entries are suggested in the kind message I

>    - Suggest more memory
>     - Note that more swap, swapfiles, and swapd may be used; but that
>       these solutions may cause excessive system slowdown

Definitely. Not good.

>   - Warn about particular tasks utilizing a great percentage more
>     memory than when they started (difficult!  Tasks all have different
>     needs!)
>  + CPU audit
>   - Audit lengths of high CPU usage bursts
>   - Allow tracking of which programs use large amounts of CPU for
>     extended periods, and how long
>   - When many programs (i.e. not just SETI or gcc, but esp. anything
>     linked to Xlibs, if through GTK+ or Qt or whatnot) cause
>     particularly long CPU spikes, i.e. >10S, suggest faster CPU
>    - Again, this is a difficult task, as some programs should use lots
>      of CPU

This is again bloatsome. You will use too much system resource for all
that computation, and provide the user with complex data he won't be
able to parse.

>  + Security
>   - Password testing
>    - Have John hack passwords passively
>    - Warn about weak passwords being cracked
>    - Information about exact user account is privileged information!
>      Only show to root!

We don't have root.
Average users like easy password, let them do. A notice about passwd
best composition when changing password is enough. Remember we have a
"no open port" default policy.
If my GF gets prompted for a new password on a hourly basis because
she chose my nickname as a passwd, she'll trash the system (or she'll
trash me), I think ;o)

>   - PaX tests
>    - PaX test once at boot as root and as normal user
>    - PaX test once every 24 hours again to detect abnormal kernel states
>    - Warn about abnormal failures
>     - Possibly use ProPolice with paxtest to finish off tests PaX does
>       NOT cover (don't warn about those tests, i.e. ret2library)

This is only meaningful to some environments. Again, this is intended
at experienced users who can install and do the right stuff
themselves. Average user doesn't care/doesn't know about PaX.
The average user is not running a shell-account server, I think :)

>    - Passively scan system at idle time for libraries and executables
>      with relocations
>     - Allow review in the console

console... Are we still talking about Mr. Foo? :)

>     - Allow specific active scanning of system or individual binaries
>    - Passively scan system at idle time for ELF ET_EXEC executables
>     - Allow review in console
>     - Allow specific active scans
>   - ProPolice tests
>    - Test a ProPolice regression test suite once at boot
>    - Warn about failure
>    - Passively scan system at idle time for libraries and executables
>      without reference to __guard and __stack_smash_handler
>     - Allow review in console
>     - Allow specific active scans
>   - GrSecurity regression tests
>    - Use a GrSecurity regression suite to do tests at each boot
>    - Warn user of abnormal failures
>    - Run tests that need root as root
>    - Run tests that can be tested as user as user AND root

This is completely irrelevant to the average setup/user imho.

>   - Security related updates
>    - Warn when security related software updates are available (how I
>      don't know)
>    - Allow running of update manager

We already have an update manager on hoary and it works just fine.

>   - Firewall
>    - Allow remote firewall rule "modules" to be fetched to construct a
>      firewall of stock options (REQUIRE SIGNATURE)
>    - Allow on-site configuration of IP masquerading, routing, port
>      forwarding, and IP connection tracking
>    - Notify when firewall rule modules are updated and ask the user if
>      he wishes to update the firewall

Not needed. No open port by default. The user starting to install
server daemons and opening ports should know what he's doing. At most,
I can imagine he'd be prompted for the security implications of his
doing (as Mandrake does when you ask for installing Apache and the

>  + Software managment
>   - Update notifier functionality becomes integrated
>   - Easy access to synaptic, or simply integrate Synaptic

Already done, as far as I can tell.

This all look largely irrelevant and overkill to me, for something
that started out of "making an icon to indicate health", but that's
just my opinion...

Hope that helps,


Thibaut VARENE
Ubuntu, Debian and Kernel Hacker

The difference between the right word and the almost right word is the
difference between lightning and the lightning bug.
                -- Mark Twain

More information about the ubuntu-devel mailing list