Some extensive system health monitoring

John Richard Moser nigelenki at comcast.net
Mon Mar 7 22:10:53 CST 2005


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I was considering the smartmon thread about making an icon to
indicate health, and system recovery tools, and whatnot, and came up
with a bigger idea than just "Your hard drive will fail soon."
Potentially, "You need more CPU," "Memory is low," "Security regressions
detected," "Users passwords cracked," etc.

Basically, I want a pluggable system to be created for the smartmontools
if that venue will be followed, rather than just an icon to warn about
imminent disk failure.  This system would then be expanded as time goes
on so that update-manager would be integrated into it (then depricated
as a separate utility), and other tasks would be added.

Other tasks could include simple things such as a johnd/cracklib based
plug-in which notifies the user when it discovers passwords can be
cracked; or potentially a monitor to watch drive usage and warn when 95%
of a drive is used, especially when / has less than 256MiB free.

More advanced plug-ins would include monitors to watch memory usage and
make suggestions to optimize disk cache by tuning swappiness or adding
more RAM, and warn about the imminent threat of the OOM killer when swap
and RAM get too full.  A CPU analyzer to notice when X11 apps (which
should be interactive realtime tasks) are spending way too much time
cranking 100% CPU could suggest a faster CPU, or possibly more RAM if
the disk cache is very very small.

Some other things could include security regression tests, such as
periodic paxtest and grsecurity regression tests for use if Ubuntu ever
adopts those systems.  Periodically testing during system run would be
important in case the internal kernel state was destroyed by a bug,
allowing some security system (SELinux, PaX, GrSecurity) to pass checks
it should fail or otherwise eb bypassed.

A firewall utility (see below for ideas) wouldn't be bad either for
this, as a nice plug-in in the future.  Using proper GPG signatures and
firewall "modules," it should be possible to supply a flexible,
maintained firewall and allow the user to also configure NAT, DHCP, and
IP connection tracking.

In the end, a single, central utility for the monitor and maintenance of
system health could be created to monitor hardware (CPU, disk failure,
disk usage, memory, even drivers losing track of hardware that WAS there
last boot); keep software up to date; handle security regression tests
on any future security enhancements and for existing systems like
passwords (john the ripper); apply firewall rules; and handle sysetm
configuration backup and restoration.

Aside from simply flashing the icon on health issues and security
hazards, the interface should allow for categorical separation of the
context menu for the task tray program to allow for scailability in the
predicted volume of plug-ins.  Potentially:

Health      > Disk failure (S.M.A.R.T.)
              Disk usage
              Network availability
              Software updates
Performance > CPU usage statistics
              Memory usage statistics
              Disk cache size
              Network throughput
Backup      > System configuration backup
Security    > Security updates
              Password cracking
              PaX regressions
              GrSecurity regressions
              SeLinux regressions
              Stack Smash Protection
              PIE and relocation statistics
              Firewall managment

You get the idea.  The major concern is doing all this "in one place"
without clutter.  Most of this is just using existing tools (regression
tests, John the Ripper, smartmontools etc) and gathering data from /proc
(memory, CPU, disk usage, network throughput) with a pretty interface
(which must of course be coded during many laboreous hours of hacking).



System Health Indicator Terminal
(get me a better name)

Monitors should be individual plug-ins; you'll notice some of them
are specific for certain systems, like i.e. PaX tests!

Note that tests run "at boot" mean run the tests FOR THE BOOT.  Don't
bother init with actually waiting; just make init start the background
daemon, which will go about its business running tests in idle time
etc etc etc.  Passive tests are also PASSIVE, in the background at
idle time when we really really know nothing else is going on!  We
want to monitor system health and enhance the user's experience, not
extend boot time and lag the system.


Features:
 - Task tray icon for easy access
 - Warnings when system is unhealthy
 - Pluggable monitors for easy expansion and paced development

Monitors:
 + System Configuration Recovery
  - Check system configuration in /etc at each successful boot and
    make a backup tarball
  - Allow separate backup tarball of /etc on demand, isolated from
    automated backups
  - Provide for restoration of backup tarball during install and at
    any arbitrary point
   - Recovery should allow for four types of recovery:
    - Replace existing files in /etc with matching files from backup
    - Restore or augment entire /etc structure with backup
    - Restore individual /etc functions such as init.d scripts, rc.d
      runlevels, SQUID configuration, profile, authentication (passwd,
      shadow, group), apt, "Unknown configurations," etc
    - Restore individual selected files
 + S.M.A.R.T. monitoring using smartmontools
  - Warn when disks are faulty and will fail soon
 + Memory size
  - Warn when too much swap is used
   - Subtract off total size of files on tmpfs mounts in calculation
  - Warn when disk cache drops below X% (possibly 25%) of memory
   - Suggest more RAM
   - Suggest increasing swappiness if swap is relatively unused
  - Warn when too much total available memory is used
   - Combine swap and ram for total available memory
   - Ignore any disk cache over 25% of physical RAM in usage
     calculation
   - Warn when 95% limit reached
    - Warning message should briefly explain OOM killer:
      "If memory usage reaches 100%, tasks the OS thinks are unimportant
       will be terminated automatically.  No opportunity to save your
       work will be given!"
   - Suggest terminating tasks with large RSS
   - Suggest more memory
    - Note that more swap, swapfiles, and swapd may be used; but that
      these solutions may cause excessive system slowdown
  - Warn about particular tasks utilizing a great percentage more
    memory than when they started (difficult!  Tasks all have different
    needs!)
 + CPU audit
  - Audit lengths of high CPU usage bursts
  - Allow tracking of which programs use large amounts of CPU for
    extended periods, and how long
  - When many programs (i.e. not just SETI or gcc, but esp. anything
    linked to Xlibs, if through GTK+ or Qt or whatnot) cause
    particularly long CPU spikes, i.e. >10S, suggest faster CPU
   - Again, this is a difficult task, as some programs should use lots
     of CPU
 + Security
  - Password testing
   - Have John hack passwords passively
   - Warn about weak passwords being cracked
   - Information about exact user account is privileged information!
     Only show to root!
  - PaX tests
   - PaX test once at boot as root and as normal user
   - PaX test once every 24 hours again to detect abnormal kernel states
   - Warn about abnormal failures
    - Possibly use ProPolice with paxtest to finish off tests PaX does
      NOT cover (don't warn about those tests, i.e. ret2library)
   - Passively scan system at idle time for libraries and executables
     with relocations
    - Allow review in the console
    - Allow specific active scanning of system or individual binaries
   - Passively scan system at idle time for ELF ET_EXEC executables
    - Allow review in console
    - Allow specific active scans
  - ProPolice tests
   - Test a ProPolice regression test suite once at boot
   - Warn about failure
   - Passively scan system at idle time for libraries and executables
     without reference to __guard and __stack_smash_handler
    - Allow review in console
    - Allow specific active scans
  - GrSecurity regression tests
   - Use a GrSecurity regression suite to do tests at each boot
   - Warn user of abnormal failures
   - Run tests that need root as root
   - Run tests that can be tested as user as user AND root
  - Security related updates
   - Warn when security related software updates are available (how I
     don't know)
   - Allow running of update manager
  - Firewall
   - Allow remote firewall rule "modules" to be fetched to construct a
     firewall of stock options (REQUIRE SIGNATURE)
   - Allow on-site configuration of IP masquerading, routing, port
     forwarding, and IP connection tracking
   - Notify when firewall rule modules are updated and ask the user if
     he wishes to update the firewall
 + Software managment
  - Update notifier functionality becomes integrated
  - Easy access to synaptic, or simply integrate Synaptic


- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCLSW+hDd4aOud5P8RAievAJ9HblGzVMisok/ygqI5/6KalcuwygCdEzCi
W2wtQwANn80nnHxLTQXkcqI=
=hcq4
-----END PGP SIGNATURE-----



More information about the ubuntu-devel mailing list