[ubuntu-uk] RAID failure after a couple of weeks

Mon Sep 8 11:57:01 BST 2008

Hello again,

You may recall my earlier conundrum about using Ubuntu Hardy LTS as a
primary server OS.

Well I've been running it for a couple of weeks now, and the software
RAID array which I configured has failed.  (brand new server hardware
- 2 weeks old).

I've looked back through logs, and saw this before it happened.

Sep  7 23:41:38 SERVER1 kernel: [904442.098463]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  7 23:41:43 SERVER1 kernel: [904447.135776] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep  7 23:41:48 SERVER1 kernel: [904452.113225] ata4: device not ready
(errno=-16), forcing hardreset
Sep  7 23:41:48 SERVER1 kernel: [904452.113229] ata4: soft resetting link
Sep  7 23:42:19 SERVER1 kernel: [904482.307776] ata4.00: qc timeout (cmd 0xec)
Sep  7 23:44:00 SERVER1 kernel: [904482.307782] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep  7 23:44:00 SERVER1 kernel: [904482.307834] ata4: failed to
recover some devices, retrying in 5 secs
Sep  7 23:44:00 SERVER1 kernel: [904492.352627] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep  7 23:44:00 SERVER1 kernel: [904497.330077] ata4: device not ready
(errno=-16), forcing hardreset
Sep  7 23:44:00 SERVER1 kernel: [904497.330080] ata4: soft resetting link
Sep  7 23:44:00 SERVER1 kernel: [904527.524627] ata4.00: qc timeout (cmd 0xec)
Sep  7 23:44:00 SERVER1 kernel: [904527.524632] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep  7 23:44:00 SERVER1 kernel: [904527.524683] ata4: failed to
recover some devices, retrying in 5 secs
Sep  7 23:44:00 SERVER1 kernel: [904537.569477] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep  7 23:44:00 SERVER1 kernel: [904542.546929] ata4: device not ready
(errno=-16), forcing hardreset
Sep  7 23:44:00 SERVER1 kernel: [904542.546932] ata4: soft resetting link
Sep  7 23:44:00 SERVER1 kernel: [904572.741478] ata4.00: qc timeout (cmd 0xec)
Sep  7 23:44:00 SERVER1 kernel: [904572.741484] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep  7 23:44:00 SERVER1 kernel: [904572.741536] ata4.00: disabled
Sep  7 23:44:00 SERVER1 kernel: [904578.288633] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep  7 23:44:00 SERVER1 kernel: [904583.266084] ata4: device not ready
(errno=-16), forcing hardreset
Sep  7 23:44:00 SERVER1 kernel: [904583.266088] ata4: soft resetting link
Sep  7 23:44:00 SERVER1 kernel: [904583.426035] ata4: EH complete
Sep  7 23:44:00 SERVER1 kernel: [904583.426045] sd 3:0:0:0: [sdb]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep  7 23:44:00 SERVER1 kernel: [904583.426052] end_request: I/O
error, dev sdb, sector 468937151
Sep  7 23:44:00 SERVER1 kernel: [904583.426058] md: super_written gets
error=-5, uptodate=0
Sep  7 23:44:00 SERVER1 kernel: [904583.426064] ^IOperation continuing
on 1 devices
Sep  7 23:44:00 SERVER1 kernel: [904583.439176] RAID1 conf printout:
Sep  7 23:44:00 SERVER1 kernel: [904583.439182]  --- wd:1 rd:2
Sep  7 23:44:00 SERVER1 kernel: [904583.439184]  disk 0, wo:0, o:1, dev:sda1
Sep  7 23:44:00 SERVER1 kernel: [904583.439187]  disk 1, wo:1, o:0, dev:sdb1
Sep  7 23:44:00 SERVER1 kernel: [904583.465980] RAID1 conf printout:
Sep  7 23:44:00 SERVER1 kernel: [904583.465984]  --- wd:1 rd:2
Sep  7 23:44:00 SERVER1 kernel: [904583.465985]  disk 0, wo:0, o:1, dev:sda1

Followed by what looks like Ubuntu removing the failed device from the array:

Sep  7 23:44:00 SERVER1 kernel: [904583.439176] RAID1 conf printout:
Sep  7 23:44:00 SERVER1 kernel: [904583.439182]  --- wd:1 rd:2
Sep  7 23:44:00 SERVER1 kernel: [904583.439184]  disk 0, wo:0, o:1, dev:sda1
Sep  7 23:44:00 SERVER1 kernel: [904583.439187]  disk 1, wo:1, o:0, dev:sdb1
Sep  7 23:44:00 SERVER1 kernel: [904583.465980] RAID1 conf printout:
Sep  7 23:44:00 SERVER1 kernel: [904583.465984]  --- wd:1 rd:2
Sep  7 23:44:00 SERVER1 kernel: [904583.465985]  disk 0, wo:0, o:1, dev:sda1

MDADM appears to have corroborated this by having moved the drive in
question (SDB) to failed spare, and marking the RAID1 array as
degraded.

So, before I give the server company a bell. Does anyone know if this
definitely looks like an error caused by a failed disk drive, or is it
an Ubuntu bug (which I seem to have hit a few when using Hardy - more
than you'd expect for an LTS - that are already reported). Smartctl
doesn't return a result on the drive, it does however from the
remaining working drive.

Arghhhhhhhh !!!!!!

Chris