[ubuntu-uk] RAID failure after a couple of weeks
Chris Rowson
christopherrowson at gmail.com
Mon Sep 8 11:57:01 BST 2008
Hello again,
You may recall my earlier conundrum about using Ubuntu Hardy LTS as a
primary server OS.
Well I've been running it for a couple of weeks now, and the software
RAID array which I configured has failed. (brand new server hardware
- 2 weeks old).
I've looked back through logs, and saw this before it happened.
Sep 7 23:41:38 SERVER1 kernel: [904442.098463] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 7 23:41:43 SERVER1 kernel: [904447.135776] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep 7 23:41:48 SERVER1 kernel: [904452.113225] ata4: device not ready
(errno=-16), forcing hardreset
Sep 7 23:41:48 SERVER1 kernel: [904452.113229] ata4: soft resetting link
Sep 7 23:42:19 SERVER1 kernel: [904482.307776] ata4.00: qc timeout (cmd 0xec)
Sep 7 23:44:00 SERVER1 kernel: [904482.307782] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep 7 23:44:00 SERVER1 kernel: [904482.307834] ata4: failed to
recover some devices, retrying in 5 secs
Sep 7 23:44:00 SERVER1 kernel: [904492.352627] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep 7 23:44:00 SERVER1 kernel: [904497.330077] ata4: device not ready
(errno=-16), forcing hardreset
Sep 7 23:44:00 SERVER1 kernel: [904497.330080] ata4: soft resetting link
Sep 7 23:44:00 SERVER1 kernel: [904527.524627] ata4.00: qc timeout (cmd 0xec)
Sep 7 23:44:00 SERVER1 kernel: [904527.524632] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep 7 23:44:00 SERVER1 kernel: [904527.524683] ata4: failed to
recover some devices, retrying in 5 secs
Sep 7 23:44:00 SERVER1 kernel: [904537.569477] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep 7 23:44:00 SERVER1 kernel: [904542.546929] ata4: device not ready
(errno=-16), forcing hardreset
Sep 7 23:44:00 SERVER1 kernel: [904542.546932] ata4: soft resetting link
Sep 7 23:44:00 SERVER1 kernel: [904572.741478] ata4.00: qc timeout (cmd 0xec)
Sep 7 23:44:00 SERVER1 kernel: [904572.741484] ata4.00: failed to
IDENTIFY (I/O error, err_mask=0x4)
Sep 7 23:44:00 SERVER1 kernel: [904572.741536] ata4.00: disabled
Sep 7 23:44:00 SERVER1 kernel: [904578.288633] ata4: port is slow to
respond, please be patient (Status 0xd0)
Sep 7 23:44:00 SERVER1 kernel: [904583.266084] ata4: device not ready
(errno=-16), forcing hardreset
Sep 7 23:44:00 SERVER1 kernel: [904583.266088] ata4: soft resetting link
Sep 7 23:44:00 SERVER1 kernel: [904583.426035] ata4: EH complete
Sep 7 23:44:00 SERVER1 kernel: [904583.426045] sd 3:0:0:0: [sdb]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 7 23:44:00 SERVER1 kernel: [904583.426052] end_request: I/O
error, dev sdb, sector 468937151
Sep 7 23:44:00 SERVER1 kernel: [904583.426058] md: super_written gets
error=-5, uptodate=0
Sep 7 23:44:00 SERVER1 kernel: [904583.426064] ^IOperation continuing
on 1 devices
Sep 7 23:44:00 SERVER1 kernel: [904583.439176] RAID1 conf printout:
Sep 7 23:44:00 SERVER1 kernel: [904583.439182] --- wd:1 rd:2
Sep 7 23:44:00 SERVER1 kernel: [904583.439184] disk 0, wo:0, o:1, dev:sda1
Sep 7 23:44:00 SERVER1 kernel: [904583.439187] disk 1, wo:1, o:0, dev:sdb1
Sep 7 23:44:00 SERVER1 kernel: [904583.465980] RAID1 conf printout:
Sep 7 23:44:00 SERVER1 kernel: [904583.465984] --- wd:1 rd:2
Sep 7 23:44:00 SERVER1 kernel: [904583.465985] disk 0, wo:0, o:1, dev:sda1
Followed by what looks like Ubuntu removing the failed device from the array:
Sep 7 23:44:00 SERVER1 kernel: [904583.439176] RAID1 conf printout:
Sep 7 23:44:00 SERVER1 kernel: [904583.439182] --- wd:1 rd:2
Sep 7 23:44:00 SERVER1 kernel: [904583.439184] disk 0, wo:0, o:1, dev:sda1
Sep 7 23:44:00 SERVER1 kernel: [904583.439187] disk 1, wo:1, o:0, dev:sdb1
Sep 7 23:44:00 SERVER1 kernel: [904583.465980] RAID1 conf printout:
Sep 7 23:44:00 SERVER1 kernel: [904583.465984] --- wd:1 rd:2
Sep 7 23:44:00 SERVER1 kernel: [904583.465985] disk 0, wo:0, o:1, dev:sda1
MDADM appears to have corroborated this by having moved the drive in
question (SDB) to failed spare, and marking the RAID1 array as
degraded.
So, before I give the server company a bell. Does anyone know if this
definitely looks like an error caused by a failed disk drive, or is it
an Ubuntu bug (which I seem to have hit a few when using Hardy - more
than you'd expect for an LTS - that are already reported). Smartctl
doesn't return a result on the drive, it does however from the
remaining working drive.
Arghhhhhhhh !!!!!!
Chris
More information about the ubuntu-uk
mailing list