Disk monitoring

Mon Feb 28 12:33:11 CST 2005

Marcus Bauer wrote:
> On Mon, 2005-02-28 at 08:55 -0800, Matt Zimmerman wrote:
> 
> 
>>The best answer to this is to use SMART to monitor the disks.  There is
>>software for this in universe, but it would need some work to have it
>>integrated properly "out of the box".
>>
>>This was on the list of proposed goals for Hoary, but no one expressed
>>interest in working on it.  If anyone reading this would like to work on
>>such a project for the next release, contact me.
> 
> 
> I'm using the smarttools on several machines but still believe that for
> the average user a good backup and his five senses are the best weapon
> against disk failure (i.e. when you realise that the makes strange
> noises and programs start slower than usual...)
> 
> A sample output from a machine here:
> 
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000b   100   100   032    Pre-fail  Always       -       60197
>   2 Throughput_Performance  0x0005   100   100   020    Pre-fail  Offline      -       57
>   3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       0
>   4 Start_Stop_Count        0x0012   001   001   016    Old_age   Always   FAILING_NOW 65535
>   5 Reallocated_Sector_Ct   0x0033   100   100   024    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x000b   100   100   020    Pre-fail  Always       -       280
>   8 Seek_Time_Performance   0x0005   100   100   019    Pre-fail  Offline      -       1074003971
>   9 Power_On_Hours          0x0012   027   027   020    Old_age   Always       -       39866799
>  10 Spin_Retry_Count        0x0013   100   100   020    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0012   098   098   020    Old_age   Always       -       342
> 196 Reallocated_Event_Count 0x0033   100   100   024    Pre-fail  Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   020    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x000b   100   100   020    Pre-fail  Always       -       1792
> 203 Run_Out_Cancel          0x0002   094   094   020    Old_age   Always       -       408021827589
> 
> 
> As you note Start_Stop_Count is "FAILING_NOW" but the disk is running
> like a charme since over a year. As long as Spin_Up_Time and
> Seek_Time_Performance are 100% I don't care at all as the spindle is
> still in perfect condition.

Start_Stop_Count is an Old_age attribute. Expiration of this attribute 
is not an indication that the disk is dieing; just that it is too old 
and should be replaced. Take this with a grain of salt if you wish, 
given that it's in the disk manufacturer's best interest to make you buy 
disks more often. ;)

If any of the Pre-fail attributes drop below their thresholds, that is 
the time to panic!

> A second thing to note here is that the Power_On_Hours are 39866799
> which is about 5000 years, i.e. the manufacturer is storing seconds in
> that value but smartctl doesn't know about this disk.
> 
> With the smarttools I experienced that the exception is quite often the
> norm. 
> 
> I believe that the smartctl output is interesting on an engineering
> level but it is very difficult to automagically produce reliable advice
> from this to the enduser. 

I've thought about this before. I decided that there's no reason an 
end-user tool needs to be much more complicated than displaying one of:

  * disk is fine
  * disk is too old (Start_Stop_Count)
  * disk will fail on 25/5/05
  * DISK FAILURE IMMINENT! HEAD FOR THE HILLS! (Seek_Time_Performance)

> IMHO this is only interesting on a server where you additionally monitor
> the values over time and see the tendencies of the values.

I imagined a daemon that is a simplified version of Smartd (in fact my 
prototype was just a shell script that ran every couple of hours) in 
DEVICESCAN mode.

The user-space program would merely look at the output of the daemon and 
display a message as above. Since then, dbus has come into being; I now 
think I would have the daemon provide a disk monitoring service 
(apologies if my terminology isn't correct here), and run one client per 
user session.

Since I noticed a few weeks ago that I was missing things like Oops 
messages from ndiswrapper while in Gnome, I have also pondered modifying 
syslogd to provide a "critical events service", and make the client into 
more of a general emergency event notification program.

> Just my two pence.
> Marcus

Regards,

-- 
Sam Morris
http://robots.org.uk/

PGP key id 5EA01078
Fingerprint 3412 EA18 1277 354B 991B  C869 B219 7FDB 5EA0 1078