[Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs
Krister Johansen
2036467 at bugs.launchpad.net
Fri Jan 12 06:08:04 UTC 2024
Hi Matthew,
Thanks for the update. I went ahead and tested your updated packages on a Focal, Jammy, and Noble image in EC2 this evening. With the latest packages installed, I was unable to reproduce the problem on any of the three installs. I'm uncertain which builds were inconsistent about triggering the problem for you, but it might be worth noting that the version of the package after Focal got an additional partial fix for the superblock checksum mismatch. In those cases, it'll re-try the read of the block up to 3 times before returning a failure. In my previous testing, this would increase the amount of time before one hits the problem, but not eliminate it entirely.
Thanks again for you help with getting these patches in. It's much
appreciated!
-K
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467
Title:
Resizing cloud-images occasionally fails due to superblock checksum
mismatch in resize2fs
Status in cloud-images:
New
Status in e2fsprogs package in Ubuntu:
In Progress
Status in e2fsprogs source package in Trusty:
Won't Fix
Status in e2fsprogs source package in Xenial:
Won't Fix
Status in e2fsprogs source package in Bionic:
Won't Fix
Status in e2fsprogs source package in Focal:
In Progress
Status in e2fsprogs source package in Jammy:
In Progress
Status in e2fsprogs source package in Lunar:
In Progress
Status in e2fsprogs source package in Mantic:
In Progress
Bug description:
[Impact]
This is a long running bug plaguing cloud-images, where on a rare
occasion resize2fs would fail and the image would not resize to fit
the entire disk.
Online resizes would fail due to a superblock checksum mismatch, where
the superblock in memory differs from what is currently on disk due to
changes made to the image.
$ resize2fs /dev/nvme1n1p1
resize2fs 1.47.0 (5-Feb-2023)
resize2fs: Superblock checksum does not match superblock while trying to open /dev/nvme1n1p1
Couldn't find valid filesystem superblock.
Changing the read of the superblock to Direct I/O solves the issue.
[Testcase]
Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
use as a scratch disk.
Run the following script, courtesy of Krister Johansen and his team:
#!/usr/bin/bash
set -euxo pipefail
while true
do
parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
sleep .5
mkfs.ext4 /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /mnt
stress-ng --temp-path /mnt -D 4 &
STRESS_PID=$!
sleep 1
growpart /dev/nvme1n1 1
resize2fs /dev/nvme1n1p1
kill $STRESS_PID
wait $STRESS_PID
umount /mnt
wipefs -a /dev/nvme1n1p1
wipefs -a /dev/nvme1n1
done
Test packages are available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test
If you install the test packages, the race no longer occurs.
[Where problems could occur]
We are changing how resize2fs reads the superblock from underlying
disks.
If a regression were to occur, resize2fs could fail to resize offline
or online volumes. As all cloud-images are online resized during their
initial boot, this could have a large impact to public and private
clouds should a regression occur.
[Other info]
Upstream mailing list discussion:
https://lore.kernel.org/linux-ext4/20230605225221.GA5737@templeofstupid.com/
https://lore.kernel.org/linux-ext4/20230609042239.GA1436857@mit.edu/
This was fixed in the below commit upstream:
commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <tytso at mit.edu>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
online resizes
Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
The commit has not been tagged to any release. All supported Ubuntu
releases require this fix, and need to be published in standard non-
ESM archives to be picked up in cloud images.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions
More information about the foundations-bugs
mailing list