Disk monitoring
Sam Morris
sam at robots.org.uk
Mon Feb 28 12:33:11 CST 2005
Marcus Bauer wrote:
> On Mon, 2005-02-28 at 08:55 -0800, Matt Zimmerman wrote:
>
>
>>The best answer to this is to use SMART to monitor the disks. There is
>>software for this in universe, but it would need some work to have it
>>integrated properly "out of the box".
>>
>>This was on the list of proposed goals for Hoary, but no one expressed
>>interest in working on it. If anyone reading this would like to work on
>>such a project for the next release, contact me.
>
>
> I'm using the smarttools on several machines but still believe that for
> the average user a good backup and his five senses are the best weapon
> against disk failure (i.e. when you realise that the makes strange
> noises and programs start slower than usual...)
>
> A sample output from a machine here:
>
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000b 100 100 032 Pre-fail Always - 60197
> 2 Throughput_Performance 0x0005 100 100 020 Pre-fail Offline - 57
> 3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 0
> 4 Start_Stop_Count 0x0012 001 001 016 Old_age Always FAILING_NOW 65535
> 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000b 100 100 020 Pre-fail Always - 280
> 8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 1074003971
> 9 Power_On_Hours 0x0012 027 027 020 Old_age Always - 39866799
> 10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0012 098 098 020 Old_age Always - 342
> 196 Reallocated_Event_Count 0x0033 100 100 024 Pre-fail Always - 0
> 198 Offline_Uncorrectable 0x0010 100 100 020 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x000b 100 100 020 Pre-fail Always - 1792
> 203 Run_Out_Cancel 0x0002 094 094 020 Old_age Always - 408021827589
>
>
> As you note Start_Stop_Count is "FAILING_NOW" but the disk is running
> like a charme since over a year. As long as Spin_Up_Time and
> Seek_Time_Performance are 100% I don't care at all as the spindle is
> still in perfect condition.
Start_Stop_Count is an Old_age attribute. Expiration of this attribute
is not an indication that the disk is dieing; just that it is too old
and should be replaced. Take this with a grain of salt if you wish,
given that it's in the disk manufacturer's best interest to make you buy
disks more often. ;)
If any of the Pre-fail attributes drop below their thresholds, that is
the time to panic!
> A second thing to note here is that the Power_On_Hours are 39866799
> which is about 5000 years, i.e. the manufacturer is storing seconds in
> that value but smartctl doesn't know about this disk.
>
> With the smarttools I experienced that the exception is quite often the
> norm.
>
> I believe that the smartctl output is interesting on an engineering
> level but it is very difficult to automagically produce reliable advice
> from this to the enduser.
I've thought about this before. I decided that there's no reason an
end-user tool needs to be much more complicated than displaying one of:
* disk is fine
* disk is too old (Start_Stop_Count)
* disk will fail on 25/5/05
* DISK FAILURE IMMINENT! HEAD FOR THE HILLS! (Seek_Time_Performance)
> IMHO this is only interesting on a server where you additionally monitor
> the values over time and see the tendencies of the values.
I imagined a daemon that is a simplified version of Smartd (in fact my
prototype was just a shell script that ran every couple of hours) in
DEVICESCAN mode.
The user-space program would merely look at the output of the daemon and
display a message as above. Since then, dbus has come into being; I now
think I would have the daemon provide a disk monitoring service
(apologies if my terminology isn't correct here), and run one client per
user session.
Since I noticed a few weeks ago that I was missing things like Oops
messages from ndiswrapper while in Gnome, I have also pondered modifying
syslogd to provide a "critical events service", and make the client into
more of a general emergency event notification program.
> Just my two pence.
> Marcus
Regards,
--
Sam Morris
http://robots.org.uk/
PGP key id 5EA01078
Fingerprint 3412 EA18 1277 354B 991B C869 B219 7FDB 5EA0 1078
More information about the ubuntu-devel
mailing list