RAID drop-out ERC/TLER was: DANGER!!! Problems with 10.04 installer (RAID devices will get corrupted)

Tue Apr 27 18:27:31 UTC 2010

Dave Howorth wrote:
> CLIFFORD ILKAY wrote:
>>> On 04/23/2010 12:17 PM, CLIFFORD ILKAY wrote:
>>>> The issue is documented here
>>>> <http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery>   and
>>>> elsewhere. Western Digital isn't the only manufacturer with this issue
>>>> (and a solution).
> ...
>> However, TLER *might* have something to do with "we'll separate the 
>> drives into consumer and pro lines and charge more for the pro lines 
>> because we can". Most consumers don't care about this issue and are 
>> unaffected by it. Those who do care about it grumble, pay more, and move on.
> 
> Clifford,
> 
> Thanks very much for that link. You pointed me to a real issue that's
> very relevant since I'm just building a new machine. So I've been doing
> some reading and here's a summary in case it helps someone.
> 
> For anybody that hasn't followed the link, the issue is that RAIDs can
> sometimes suffer drive drop-outs because the drive's error recovery
> efforts take longer than the RAID controller allows. The RAID controller
> then fails the drive.
> 
> There is a feature in the ATA-8 interface, in the SMART Command
> Transport (SCT), that allows the drive to be setup to abort recovery
> attempts sooner, so the RAID controller can have a go. This is called
> Error Recovery Control (ERC).
> 
> Most manufacturers implement this capability. WD did, though calling it
> TLER, but apparently they've now removed the feature so their 'consumer'
> drives are problematic in RAID arrays. So much for the I in RAID :(
> 
> The version of smartctl in SVN is able to issue the ERC commands but it
> won't be formally released until V5.40. See
> <http://www.csc.liv.ac.uk/~greg/projects/erc/> for details.
> 
> By pure luck, I think I'm OK. I've bought some WD RE4 drives, which
> apparently have ERC enabled and I bought some Seagate 7200.11, which can
> have ERC enabled via smartctl.
> 
> Cheers, Dave
> 

There's a flip side to consider.

If the drive needs to spend so much time on attempted data recovery, 
that drive probably needs attention in any case.  While it can be good 
for long uptime to let the raid controller fix the errors on the fly 
from the parity/redundant data, I don't consider it a bad thing at all 
that a flaky drive gets dropped and demands immediate replacement or low 
level format/test.  Of course, the needs of enterprise raid controllers 
are a bit different.  They would want to raid controller to fix the 
error and report via e-mail notification that a drive is going flaky 
without dropping a drive (which can potentially lead to data loss if any 
other drive in the array also had unreadable sectors and your raid array 
breaks.)

My point, however, is that it's an oversimplification to say that these 
drives are unsuitable for use in raid array's.  They work perfectly fine 
as is, even if not the best for use in high availability environments.

RAID drop-out ERC/TLER was: DANGER!!! Problems with 10.04 installer (RAID devices *will* get corrupted)

RAID drop-out ERC/TLER was: DANGER!!! Problems with 10.04 installer (RAID devices will get corrupted)