Monitoring for disk issues

Marius Gedminas marius at pov.lt
Fri Oct 19 12:19:24 UTC 2012


On Fri, Oct 19, 2012 at 09:43:56AM +0100, Oliver Marshall wrote:
> We have an increasing number of ubuntu based machines kicking about, either
> desktops or basic servers. Most have single disks, or dual disks with a
> software raid.
> 
> We want to monitor them for disk issues after a few of the older ones
> (admittedly very old) died.

sudo apt-get install smartmontools

Then make sure sending email to root at localhost works and forwards
somewhere appropriately (e.g. install postfix or ssmtp, define a root
alias in /etc/aliases).

> There seems to be a mass of places that we
> might look and script to check but not one place itself.
> 
> I'm told that SCSI errors should appear in /var/log/syslog and that we
> might be able to use smartmon to monitor the smart status of the disks.
> Smart statuses are notoriously unreliable though with disks failing without
> any warning from the smart chips.

After the disk fails, I think, you ought to get an email from smartd
about a new entry in the SMART Error Log.  Or maybe about an increment
in the "# of bad blocks" attribute ("reallocated sector count" or some
such).

> In the windows world we have a certain number of event codes in the event
> logs we monitor for. Is there a similar thing we can use here? Monitor for
> a certain string or code in a certain log which all disk errors can be
> expected to use?

Hm.  I don't really know anything about that.  There's logcheck, but
it's based on a blacklist of events you don't want to monitor, rather
than a whitelist of events you want to know about.

> Bare in mind we aren't using any 3rd party raid controllers. It's all
> software stuff or single disk.

If you use software RAID, you should daily emails from mdadm about
failed RAID members.

Although last time a hard disk failed for me I received emails from
smartd but nothing from mdadm, despite kernel.log containing messages
like

    # [6627534.944848] raid1:md2: read error corrected (8 sectors at 5491720 on sda6)
    # [6627534.944864] raid1: sdb6: redirecting sector 5491720 to another mirror

Marius Gedminas
-- 
"In general, it is safe and legal to kill your children and their children"
POSIX Prg Gt, by Donald Lewine, O'Reilly & Associates, 1991, p.110 (On process
termination)
        -- http://lambda.weblogs.com/discuss/msgReader$7635?mode=day
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20121019/15019f57/attachment.sig>


More information about the ubuntu-users mailing list