Some extensive system health monitoring
John Richard Moser
nigelenki at comcast.net
Tue Mar 8 12:05:24 CST 2005
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thibaut Varene wrote:
> On Mon, 07 Mar 2005 23:10:53 -0500, John Richard Moser
> <nigelenki at comcast.net> wrote:
>
> [snip]
>
>
>>More advanced plug-ins would include monitors to watch memory usage and
>>make suggestions to optimize disk cache by tuning swappiness or adding
>>more RAM, and warn about the imminent threat of the OOM killer when swap
>>and RAM get too full. A CPU analyzer to notice when X11 apps (which
>>should be interactive realtime tasks) are spending way too much time
>
>
> Hell no. This is not what the RT priority is meant for.
REAL TIME, i.e. not batch
A batch task takes X amount of data and crunches it down until it's done
with it; a real-time task has to interact with the user and thus spends
more time sleeping and waiting for keyboard/mouse interrupts than
actually using CPU.
XMMS is a real time task. It pretty much chews on a piece of an Ogg
Vorbis, spits out wave data, puts it on the sound card. It could do
that in 1 second, but it doesn't; it's paced to operate in real time,
not as a batch task faster than realtime.
>
>
>>cranking 100% CPU could suggest a faster CPU, or possibly more RAM if
>>the disk cache is very very small.
>
>
> What about suggesting the user to get some coffee when building a big
> source tree? :)
>
Fuzzy logic :)
>
>>You get the idea. The major concern is doing all this "in one place"
>>without clutter. Most of this is just using existing tools (regression
>>tests, John the Ripper, smartmontools etc) and gathering data from /proc
>>(memory, CPU, disk usage, network throughput) with a pretty interface
>>(which must of course be coded during many laboreous hours of hacking).
>>
>>System Health Indicator Terminal
>>(get me a better name)
>>
>>Monitors should be individual plug-ins; you'll notice some of them
>>are specific for certain systems, like i.e. PaX tests!
>>
>>Note that tests run "at boot" mean run the tests FOR THE BOOT. Don't
>>bother init with actually waiting; just make init start the background
>>daemon, which will go about its business running tests in idle time
>>etc etc etc. Passive tests are also PASSIVE, in the background at
>>idle time when we really really know nothing else is going on! We
>>want to monitor system health and enhance the user's experience, not
>>extend boot time and lag the system.
>
>
> A few generic remarks about what you suggest:
>
> though providing the user with *useful* info is a good thing,
> providing him with *loads* of unexpected, unwanted, incomprehensible
> info is a *very bad thing*.
>
Keep it simple.
For example, imagine Ubuntu supported GrSecurity.
(!)
*hover* "Security warning"
*click*
One or more of the PaX tests failed. This means that your computer
may be vulnerable to security attacks. Please upgrade your kernel if
a new one is out.
[Check for updates] [Advanced] [Close]
Would you like to send information about this issue back to Ubuntu?
The information is about the problem only, and does not contain any
information about you except for info about your computer's hardware
in some cases.
[Send report]
Although the details of the PaX test failing-- say, anonymous mappings
are created writable and executable by mistake-- are way out of line,
the above message is quite simple and straight forward. "Something you
don't understand broke. You're in danger. Upgrade this thing. Just
click here, we know what to do, don't try to think about it." Your
average geek will obviously understand everything here, but it's enough
for your average user to just say "Oh, security, updates, do that yes."
This could be done for smartmon too.
(!)
*hover* "Disk health EMERGENCY*
*click*
The hda disk, which is your /primary master IDE/ disk, reports that
it may fail soon, somewhere around 2/15/2006, or in 3 days. Please
replace or have a technician replace this disk ASAP. It is
recommended that you back up your data or shut down your computer
until then.
The disk "hda" contains / and swap, No unknown partitions have been
detected on this disk. Failure to replace this disk soon may force
you to reinstall Ubuntu Linux.
If /home is on /:
The / partition also contains /home, which contains all of your
settings and usually contains your personal files. Failure of the
disk "hda" will result in loss of these files.
If /home is on hda:
The disk "hda" contains /home, which contains all of your settings
and usually contains your personal files. Failure of the disk "hda"
will result in loss of these files.
[Close] [Advanced]
All of these messages are very straightforward: "Your hard disk is
going to fail, it contains these partitions, and they normally hold
data, repricussions will be as follows if you do not replace it NOW."
> Keep in mind we're not aiming at the average geek, we're aiming at the
> average newbie.
Yes.
> The geek-type can install gkrellm and get pretty much
> everything you're talking about (minus the security regression tests i
> suppose). The admin-type would install snmpd and get the same through
> MIB database. The average user doesn't care about PAX, network
> throughput etc.
(!)
*hover* "Network health warning"
*click*
Your network appears to be slow. Many broadcast packets are being
recieved; this may be the cause of the problem. Please inform your
network administrator or someone who understands what this message
means.
[Close] [Advanced] [X] Don't bohter me with this anymore
> In fact, not only doesn't he care, most of the time he
> can't grasp the meaning of these words... (and believe me, switching a
> few computer beginners to linux gave me quite a good insight of what
> they care about ;)
>
:)
>
>>Features:
>> - Task tray icon for easy access
>
> gkrellm and the like
>
>> - Warnings when system is unhealthy
>
> gkrellm, snmpd and the like
>
>> - Pluggable monitors for easy expansion and paced development
>
> gkrellm, snmpd and the like
>
>
>>Monitors:
>> + System Configuration Recovery
>> - Check system configuration in /etc at each successful boot and
>> make a backup tarball
>
>
> Pretty pointless. Where would you backup that tarball? If the
> filesystem gets corrupted, it's likely you can't trust any file stored
> on that filesystem. Now remember that we suggest a single partition
> installation...
>
Right, Ubuntu (poorly) suggests a single partition install rather than
separated /home. I of course have been using separate /home for a year
or two now and have reinstalled literally about 30 or 40 times without
having to back up anything or hunt for disks or lose my 30 gigs of music
and settings and CD ISOs and such on /home.
A separate /home is great because in the worst case a user can reinstall
with in most cases absolutely no detriment. :)
Dump the tarball to a network drive, let the user burn it to CD or
something.
>
>> - Allow separate backup tarball of /etc on demand, isolated from
>> automated backups
>> - Provide for restoration of backup tarball during install and at
>> any arbitrary point
>> - Recovery should allow for four types of recovery:
>> - Replace existing files in /etc with matching files from backup
>> - Restore or augment entire /etc structure with backup
>> - Restore individual /etc functions such as init.d scripts, rc.d
>> runlevels, SQUID configuration, profile, authentication (passwd,
>> shadow, group), apt, "Unknown configurations," etc
>> - Restore individual selected files
>
>
> This would only be useful either if the disk got corrupted and the
> backup has been made on a safe media, or if the user is editing the
> config files by himself. In the first case, it's not a background
> task, and could be implemented as a userland utility, and in the
> latter case, the user is smart enough to backup his files before
> editing.
>
> If you fear filesystem corruption, see the remark above.
>
> This looks much to me like "automate my routine admin tasks", more
> than "provide useful information to the user"...
>
It was meant to be "automate my admin tasks" a la MS' "System Restore."
My uncle broke his PC by using System Restore about 60 times in one
month, so I figured since it was so awesome that people had to use it
twice a day, it should be implemented on Linux.
>
>> + S.M.A.R.T. monitoring using smartmontools
>> - Warn when disks are faulty and will fail soon
>
>
> Certainly a good thing. An icon or a message box is enough i think.
>
See above.
>
>> + Memory size
>> - Warn when too much swap is used
>
>
> Back in the early days of MacOS <=9, when you were using too much
> memory, you'd get a message like "your system is running low on
> memory, please consider quitting some running applications". That is
> imho the maximal input you should give the user. An experienced user
> won't need more input since he can get the info by himself, and a
> newbie has just enough data to know: 1) what happens and 2) what to
> do.
>
(!)
*hover* "Memory health warning"
*click*
You are using a lot of swap. This slows your system down
considerably. If you experience noticible slowdowns, consider buying
more RAM.
A sign that swapping is causing slowdowns is that the hard disk
activity light, a light on your computer case which is normally red
and usually flashing when you load a program, is flickering a lot.
Also, if it takes unusually long to open files or programs, you may
be using too much swap and should consider buying more RAM.
[Close] [Advanced]
>
>> - Subtract off total size of files on tmpfs mounts in calculation
>> - Warn when disk cache drops below X% (possibly 25%) of memory
>> - Suggest more RAM
>
>
> Telling the user he has sucky hardware doesn't look that good to me... :)
>
See above.
also by "disk cache low" i meant cache/buffers and unused physical RAM,
not "Disk cache is 1 meg and there's 1000 megs of free ram too"
>
>> - Suggest increasing swappiness if swap is relatively unused
>
>
> What do you mean? Swap is handled by the kernel VM, you don't want an
> average user to dive in and tweak the kernel behaviour, do you? :)
>
(!)
*hover* "Memory health warning"
*click*
Your disk cache is low. Disk cache and buffers hold data from the
hard disk in memory so that your computer is faster. You are getting
this message because les than 25% of your memory is available for
disk cache.
If your computer runs slowly or takes a long time to load programs,
consider adding more RAM. Any unused RAM is used for disk cache by
Linux automatically, making your computer faster. Alternately you
can adjust how much disk cache is favored over program memory, but
this is an advanced option and may slow down your system if abused.
[Close] [Advanced]
>
>> - Warn when too much total available memory is used
>> - Combine swap and ram for total available memory
>> - Ignore any disk cache over 25% of physical RAM in usage
>> calculation
>
>
> This is bloatsome. No need for such complex mechanism imho.
> Look at how gnome panel "system monitor" applet computes used RAM: it
> doesn't count cache (more precisely it separates it from the output).
> This is enough to know how much free memory is available. If the
> kernel needs to recover cached memory, it can do it.
>
>
>> - Warn when 95% limit reached
>> - Warning message should briefly explain OOM killer:
>> "If memory usage reaches 100%, tasks the OS thinks are unimportant
>> will be terminated automatically. No opportunity to save your
>> work will be given!"
>
>
> This kind of message should go along the one I suggested I think.
>
(!)
*hover* "Memory health EMERGENCY"
*click*
Your computer is using almost all of its available memory!!! If you
reach 100% memory usage, the kernel will automatically close a random
program without allowing you to save you work!!! This MUST be done
or the system will stop responding!!!
Please close some large programs. Below is a list of the top five
programs run by your user which are using a lot of memory. Web
browsers, e-mail clients, and media players make the best first
choices, however. P2P software can be very RAM intensive as well.
If this happens often, you should consider buying more RAM. You may
also increase swap; however, using more swap will result in slowing
the computer down. If you suspect a program is not working properly
and using up excess RAM, try to avoid using that program.
Warning: Terminating the selected tasks will allow them to save
their work first; they may ask you questions and not end until you
respond.
[ ] Gaim
[ ] Thunderbird
[ ] Firefox
[ ] Gnome Terminal
[ ] Totem
[Terminate selected tasks]
[Close] [Advanced]
Above, assume GAIM has a plug-in or such that's leaking memory and now
it eats 600M of RAM. It shows up first now.
>
>> - Suggest terminating tasks with large RSS
>
>
> hints as to which tasks consume more memory might be interesting. You
> would want to filter out all that aren't running with the user's UID
> to avoid suggesting killing "X" for instance :) That could be useful
> if say the first three entries are suggested in the kind message I
> mentioned.
>
:)
>
>> - Suggest more memory
>> - Note that more swap, swapfiles, and swapd may be used; but that
>> these solutions may cause excessive system slowdown
>
>
> Definitely. Not good.
>
>
>> - Warn about particular tasks utilizing a great percentage more
>> memory than when they started (difficult! Tasks all have different
>> needs!)
>> + CPU audit
>> - Audit lengths of high CPU usage bursts
>> - Allow tracking of which programs use large amounts of CPU for
>> extended periods, and how long
>> - When many programs (i.e. not just SETI or gcc, but esp. anything
>> linked to Xlibs, if through GTK+ or Qt or whatnot) cause
>> particularly long CPU spikes, i.e. >10S, suggest faster CPU
>> - Again, this is a difficult task, as some programs should use lots
>> of CPU
>
>
> This is again bloatsome. You will use too much system resource for all
> that computation, and provide the user with complex data he won't be
> able to parse.
>
Yeah I was thinking more technical analysis when i wrote that. No
justification.
>
>> + Security
>> - Password testing
>> - Have John hack passwords passively
>> - Warn about weak passwords being cracked
>> - Information about exact user account is privileged information!
>> Only show to root!
>
>
> We don't have root.
Yes you do. you have sudo, to get root. Sudo to read the data.
> Average users like easy password, let them do. A notice about passwd
> best composition when changing password is enough. Remember we have a
> "no open port" default policy.
If the user has sudo access, there may still be a way in. A flawin
Gaim, X-chat, or Firefox would allow malicious parties to hijack these
programs to enter the system (hence why I advocate PaX/GrSecurity and
SSP to squash lots of these flaws).
> If my GF gets prompted for a new password on a hourly basis because
> she chose my nickname as a passwd, she'll trash the system (or she'll
> trash me), I think ;o)
>
True.
>
>> - PaX tests
>> - PaX test once at boot as root and as normal user
>> - PaX test once every 24 hours again to detect abnormal kernel states
>> - Warn about abnormal failures
>> - Possibly use ProPolice with paxtest to finish off tests PaX does
>> NOT cover (don't warn about those tests, i.e. ret2library)
>
>
> This is only meaningful to some environments. Again, this is intended
> at experienced users who can install and do the right stuff
> themselves. Average user doesn't care/doesn't know about PaX.
> The average user is not running a shell-account server, I think :)
>
PaX is exactly NOT designed to help secure shell account servers. PaX
is designed so that if some idiot wrote a bad decoding routine in libpng
(or maybe SIX OF THEM?), Mozilla won't install malware that sniffs your
passwords out while you sudo around because it was inside a corrupt png
image.
And the PAX and Gr regression things are only for those environments,
not general.
Remember one of the things I want is for PaX, GrSecurity, and SSP to be
integrated with all major distributions. I can send you a nice paper I
wrote which explains why; it's not so hard to justify. It takes a
little work and gives a lot of gains, doesn't do anything to confuse the
user once it's in place, and is easy to maintain once it's in place.
>
>> - Passively scan system at idle time for libraries and executables
>> with relocations
>> - Allow review in the console
>
>
> console... Are we still talking about Mr. Foo? :)
>
Window/console. If it's GUI it's still a console. The console is the
local machine or whatnot.
It was an abuse of terms.
>
>> - Allow specific active scanning of system or individual binaries
>> - Passively scan system at idle time for ELF ET_EXEC executables
>> - Allow review in console
>> - Allow specific active scans
>> - ProPolice tests
>> - Test a ProPolice regression test suite once at boot
>> - Warn about failure
>> - Passively scan system at idle time for libraries and executables
>> without reference to __guard and __stack_smash_handler
>> - Allow review in console
>> - Allow specific active scans
>> - GrSecurity regression tests
>> - Use a GrSecurity regression suite to do tests at each boot
>> - Warn user of abnormal failures
>> - Run tests that need root as root
>> - Run tests that can be tested as user as user AND root
>
>
> This is completely irrelevant to the average setup/user imho.
>
I was thinking in terms of the future.
>
>> - Security related updates
>> - Warn when security related software updates are available (how I
>> don't know)
>> - Allow running of update manager
>
>
> We already have an update manager on hoary and it works just fine.
>
Yes, integrate it into a bigger suite.
>
>> - Firewall
>> - Allow remote firewall rule "modules" to be fetched to construct a
>> firewall of stock options (REQUIRE SIGNATURE)
>> - Allow on-site configuration of IP masquerading, routing, port
>> forwarding, and IP connection tracking
>> - Notify when firewall rule modules are updated and ask the user if
>> he wishes to update the firewall
>
>
> Not needed. No open port by default. The user starting to install
> server daemons and opening ports should know what he's doing. At most,
> I can imagine he'd be prompted for the security implications of his
> doing (as Mandrake does when you ask for installing Apache and the
> like)...
Malware could open ports by itself. Do I need to write some and sneak
it on your computer one day, then sniff out your root password by
advanced social engineering?
Lesse. alias sudo="myevilsudo"
sudo bash
Password: *******
Authentication failed, try again
<myevilsudo drops the alias from your shell>
sudo bash
password: ********
root#
I pwn ur r00t
>
>
>> + Software managment
>> - Update notifier functionality becomes integrated
>> - Easy access to synaptic, or simply integrate Synaptic
>
>
> Already done, as far as I can tell.
>
Yeah, I meant if you're going to have a system health center, integrate
the existing update manager into it. No sense in having this separated.
> This all look largely irrelevant and overkill to me, for something
> that started out of "making an icon to indicate health", but that's
> just my opinion...
>
I'm one of those people who likes a full, robust suite of everything. I
don't think you should pay a computer technician to tell you you need
more RAM; you should pay him to actually put it in. Your disks
shouldn't fail; they should warn you so you replace them before they
fail. Your security should be working, or you should know it's busted.
> Hope that helps,
>
> T-Bone
>
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCLeljhDd4aOud5P8RAnTXAJ9Nl5IBNJxDUYgYu5SkkSzPAQq3ZQCffwY5
5sUZXtgFjvMmRM1ETkVL+0s=
=FcrM
-----END PGP SIGNATURE-----
More information about the ubuntu-devel
mailing list