LVM detecting failed disks
Aaron Rainbolt
arraybolt3 at gmail.com
Thu Nov 10 14:38:52 UTC 2022
On Thu, Nov 10, 2022 at 1:14 AM Alexander Hartner <thahartner at gmail.com> wrote:
>
> We had a number of failed disks on a project which resulted in the affected system entering a zombie state, where the disk was no longer accessible and the host non responsive.
>
> While investigating the issue we simulated disk failure by detaching the disk on virtualbox. Using this methodology we evaluated several different combinations
>
> MDADM RAID + LVM + XFS
> LVM + ZFS
> ZFS
> LVM + EXT4
> EXT4
>
> We also tried to set the following kernel / disk options to trigger a panic and automatically reboot the system
>
> XFS:
> sysctl -w fs.xfs.panic_mask=256
> sysctl -w kernel.panic_on_io_nmi=1
> sysctl -w kernel.panic=3
> <Disk disconnect>
>
> EXT4:
> tune2fs -e panic /dev/mapper/ubuntu--vg-ubuntu--lv (or)
> tune2fs -e panic /dev/sda2
> sysctl -w kernel.panic=3
> <Disk disconnect>
>
> Findings
> The results of our testing were that when using EXT4 we were able to trigger a panic and reboot on some of the cases. However with XFS the above options didn't work.
>
> Also as soon as we added LVM to the setup, any errors were lost and the system just continued attempting to access the disk and failed.
>
> Is there an option in LVM to pass any hardware error detected on the underlying disks to the file system and trigger a panic / reboot ?
I'm going to suggest an entirely different solution to this problem
that will hopefully still do what you want.
The Linux kernel contains a feature known as "Magic SysRq". It's a set
of commands that can be given directly to the kernel to cause various
things to happen, even if most of the system is locked up. Assuming
the kernel is still responsive, it will obey Magic SysRq commands no
matter what else is happening. One of the Magic SysRq commands is
designed to cause a kernel panic.
The usual way of executing Magic SysRq commands is by using the
keyboard, however you can also trigger various Magic SysRq features by
writing to the /proc/sysrq-trigger file. According to
https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html, the way
to trigger a kernel panic using /proc/sysrq-trigger is by running
"echo c > /proc/sysrq-trigger" as root.
If this command does nothing, you may have to enable all Magic SysRq
features first by using "echo 1 > /proc/sys/kernel/sysrq".
With this info, you might be able to create a script that periodically
checks the kernel dmesg log for I/O errors on the flaky disk. If it
detect one, it will immediately execute "echo 1 >
/proc/sys/kernel/sysrq && echo c > /proc/sysrq-trigger". This will
cause a kernel panic when the disk goes wonky.
(Note: I am not a Canonical employee. I am only a volunteer, and while
the info I am providing is, to the best of my knowledge, correct and
up-to-date, I cannot and do not make any guarantees to its accuracy.
Any decisions you make based on this data are made at your own risk.
If you need professional support, it is recommended that you look into
Ubuntu Pro.)
--
Aaron Rainbolt
Lubuntu and Ubuntu Member
https://github.com/ArrayBolt3
https://launchpad.net/~arraybolt3
@arraybolt3:matrix.org on Matrix, arraybolt3 on irc.libera.chat
More information about the ubuntu-users
mailing list