[Bug 64548] Re: System logs fill with cdrom errors (1 GB per 20 minutes)

Wed Mar 7 12:54:55 UTC 2007

Thanks for looking at this. I'm sorry I ended up writing a long
response, I hope you'll bear with me.

No, I haven't tried changing the cable - I've detached and re-attached
it since (not immediately, but days or weeks after the incident),
though, but it may not have affected this bug. I have two drives
attached to the cable, the one affected on this bug was the slave. As
said, it was most probably caused by a faulty CD or by something that
went wrong with the drive as a disc was ejected. (I really don't know
what a "DriveReady SeekComplete Error" denotes. This is all guessing,
but some process may have tried to access the disc, not recognizing it
was ejected... or something else.)

The drive has - before and after this incident - otherwise worked well,
without any problems whatsoever.

As this was a one-time incident, even if it was the hardware's fault, it
would be rather hard to actually reproduce. Nevertheless, I tried for a
while to tease both of my drives with reading faulty CDs and ejecting
discs in mid-operation, but didn't get the same effect as here. (The
most I got was a single "cdrom: dropping to single frame dma" and
"cdrom: This disc doesn't have any tracks I recognize!" around the time
of a manual eject while cdparanoia was doing its job but stuck on some
scratch.)

I'll recount, for the record, a few incidents I faced, that share some common characteristics:
1) As mentioned, at one point I inadvertently kicked the machine, at which it started flooding I/O errors on all tty's, forcing a reboot.
2) Some time after that, I faced this bug, somewhat reminding me of that incident.
3) A bit later, still, I faced what I've reported in bug 64914 - very much reminiscent of incident number 1, but this time provoked by an 'e2fsck' on a hard disk drive. (Apparently this was the damage caused by incident 1. See the referenced report for details.)
4) After that incident, not wishing to risk my data on a potentially damaged disk, I got a new disk, did a fresh install of Dapper on that one, and put my old disk as a slave (hdb) to the new one. (After I wrote data to fill up the damaged partition, the drive - as I had hoped - noticed and repaired the fault, assigning backup sectors to replace the faulty ones.)

As I've switched to Dapper, I'm using a different kernel at the moment,
and as both of these were one-time incidents, it would be hard
reproducing them. (The other one was reproduceable, but I didn't want to
leave the disk in its damaged condition and not try to fix it.)

So, what do I think is the "issue" here? The issue, in my opinion, is that:
- Hardware may fail, without prior notice,
- but is it possible for the OS to handle it more gracefully?

_If_ I understand correctly, this _is_ problematic for the OS, because
if a hardware component "complains" about something, it may block the
operating system / program being used from responding to the situation.
If the hardware keeps malfunctioning and the OS / program can't react,
this is an uncomfortable situation for the user.

In this case, reported here, I could and did use the machine until and
after the root partition filled up, and was able to reboot nicely (which
fixed the drive ;-) and repair the problem (i.e. truncate the logs and
free up space). In the other case I reported, I couldn't reboot nicely
(though I am not sure if I tried the magic SysRq combinations).

So, back to this report:
- The drive was giving out error signals.
- This wasn't preventing the machine from functioning otherwise normally.
- The kernel dutifully logged the error messages, but there was nothing that would have caught that the message was repeating at a rate of 1 GB per hour per log, times three logfiles.

I'm not sure what can be done, but - as a layman - it would at least
seem possible (for 'syslogd' and 'klogd') to catch this kind of
situations and count the errors, and make a note somewhere that "The
previous error was repeated 186000 times during the time from DDMMYYYY
hh:mm:ss to DDMMYYYY hh:mm:ss." When thought further, based on this data
the user could also be alerted to the fact that something is amiss in a
nicer way than finding out the logs filled the root partition. (This
could, at its simplest, be by displaying the "note" on all tty's.)

-- 
System logs fill with cdrom errors (1 GB per 20 minutes)
https://launchpad.net/bugs/64548