Detecting disk failures on XFS

Tue Nov 8 06:15:26 UTC 2022

We have dealing with a problem where a NVME drive fails every so often.
More than it really should. While we are trying to make sense of the
hardware issue, we are also looking at the recovery options.

Currently we are using Ubuntu 20.04 LTS on XFS with a single NVME disk. If
the disk fails the following error is reported.

*Nov 6, 2022 @ 20:27:12.000    [1095930.104279] nvme nvme0: controller is
down; will reset: CSTS=0x3, PCI_STATUS=0x10Nov 6, 2022 @
20:27:12.000    [1095930.451711] nvme nvme0: 64/0/0 default/read/poll
queuesNov 6, 2022 @ 20:27:12.000    [1095930.453846] blk_update_request:
I/O error, dev nvme0n1, sector 34503744 op 0x1:(WRITE) flags 0x800 phys_seg
1 prio class 0*

And the system becomes completely unresponsive.

I am looking for a solution to stop the system when this happens, so the
other nodes in our cluster can carry the work. However since the system is
unresponsive and the disk presumably in read-only mode we stuck in a sort
of zombie state, where the processes are still running but don't have
access to the disk. On EXT3/4 there is an option to take the system down.

*errors={continue|remount-ro|panic}*
Define the behavior when an error is encountered.  (Either ignore errors
and just mark the filesystem erroneous and continue, or remount the
filesystem read-only, or panic and halt the system.)  The default is set in
the filesystem superblock, and can be changed using tune2fs(8).

Is there an equivalent for XFS ? I didn't find anything similar on the XFS
man page.

Also any other suggestions to better handle this ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20221108/5b6fe091/attachment.html>