Minutes from the Technical Board, 2008-07-15

Tue Aug 19 21:32:45 UTC 2008

I think there's an elephant in this room - why are we running fsck at all?

a) If it's to detect corruption due to software errors, fsck should be
linked up to apport, and reported (semi-)automatically.
b) If it's to check for dying hardware[1], it can be disabled for all
but the oldest hard drives[2], and even then is better replaced with a
badblocks check run while booting continues
c) If it's to guard against bit-flipping caused by cosmic rays and other
weirdness[3], snapshot-based solutions discussed earlier would be more
appropriate, because the most vulnerable drives are huge/highly active
ones that live on servers that never get rebooted.

The nearest to a definitive statement that I've been able to find is
from the tune2fs man page.  The following is from the text for the "-c"
option:

	Bad disk drives, cables, memory, and kernel bugs could all
	corrupt a filesystem without marking the filesystem dirty or
	in error.

(A similar message is included in the text of the "-i" option)

This seems to cover all the above alternatives.  Given that any solution
that wants to make it into Intrepid has to be feature-complete by the
28th, how about doing 'fsck ... | tee /var/tmp/fsck.log || mv
/tmp/fsck.log /var/cache/apport.log' in checkfs, then getting apport to
pick up any logs and ask to report them in the usual way?  Then we'll
have better data to make a decision with for Intrepid+1.

	- Andrew

[1]https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2007-October/001843.html
[2]https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2007-October/001856.html
[3]http://kerneltrap.org/Linux/Data_Errors_During_Drive_Communication