[Bug 2045586] Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble
dann frazier
2045586 at bugs.launchpad.net
Mon Jan 22 20:10:23 UTC 2024
I ran into this on jammy/amd64:
https://autopkgtest.ubuntu.com/results/autopkgtest-
jammy/jammy/amd64/l/livecd-rootfs/20240121_173406_e4f9a@/log.gz
I downloaded all of the amd64 failures and searched for this failure
pattern. These were the kernels that were running at the time:
"Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023"
"Linux 6.2.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 14 12:34:02 UTC 2023"
"Linux 6.3.0-7-generic #7-Ubuntu SMP PREEMPT_DYNAMIC Thu Jun 8 16:02:30 UTC 2023"
"Linux 6.5.0-9-generic #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 7 01:35:40 UTC 2023"
"Linux 6.6.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 30 10:27:29 UTC 2023"
Here's the count of failures per image type:
12 017-disk-image-uefi.binary
3 018-disk-image.binary
3 020-kvm-image.binary
1 023-vagrant.binary
1 024-vagrant.binary
I can confirm that /dev/loop0p1 is created by devtmpfs. This surprised
me because I'd never actually need to know what devtmpfs was, and I saw
devices being created even though I had SIGSTOP'd systemd-udevd. But
watching udevadm monitor and forktrace output convinced me.
I had a theory that something was opening the first created partition before all partitions were created. loop_reread_partitions() can fail without returning an error to userspace:
https://elixir.bootlin.com/linux/v5.15.147/source/drivers/block/loop.c#L676
that could happen if bdev_disk_changed() aborts because it finds another partition on the device is open:
https://elixir.bootlin.com/linux/v5.15.147/source/block/partitions/core.c#L662
But then we should see this in dmesg:
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n"
I added dmesg calls to check that:
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240122_161631_62ecd@/log.gz
.. but no such message appeared, so that's not it. But what *is*
interesting there is that it shows *2* partition scan lines:
1248s [ 990.855361] loop0: detected capacity change from 0 to 4612096
1248s [ 990.855628] loop0: p1 p14 p15
1248s [ 990.874241] loop0: p1 p14 p15
Previously we just saw 1:
1189s [ 932.268459] loop0: detected capacity change from 0 to 4612096
1189s [ 932.268715] loop0: p1 p14 p15
That only gets printed when bdev_disk_changed() is called. So do we have
2 racing callers?
One thing that seems off is that loop_configure() unsuppresses uevents
for the full device before the partition scan, but loop_change_fd()
waits until the partition scan is complete. Shouldn't they be following
the same pattern? I wonder if that could cause the following race:
[livecd-rootfs] losetup creates /dev/loop0
[livecd-rootfs] kernel sends uevent for /dev/loop0
[livecd-rootfs] /dev/loop0p* appear in devtmpfs
[udev] receives uevent for loop0
[udev] partprobe /dev/loop0
[livecd-rootfs] losetup exit(0)
[partprobe] /dev/loop0p* cleared
[livecd-rootfs] check for /dev/loop0p1 FAILS
[partprobe] /dev/loop0p* recreated
I tried checking for this using ftrace in a local jammy VM. I haven't been able to reproduce this in a local VM, but I wanted to see what happens in a normal losetup.. er... setup.
>>>>> First I used losetup to create the device:
root at dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# loopdev="$(losetup --show -f -P -v /home/ubuntu/disk.img)"
root at dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 1/1 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
losetup-1996 [000] ..... 657.573994: bdev_disk_changed <-loop_reread_partitions
>>>>> Only the expected bdev_disk_change() call
>>>>> Then I remove the device:
root at dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# losetup -v -d $loopdev
root at dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 3/3 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
losetup-1996 [000] ..... 657.573994: bdev_disk_changed <-loop_reread_partitions
systemd-udevd-2524 [000] ..... 680.555336: bdev_disk_changed <-blkdev_get_whole
losetup-2523 [000] ..... 680.568473: bdev_disk_changed <-__loop_clr_fd
>>>>> udev did rescan the partitions... but only during removal. And its
strange that it gets in there before losetup calls it itself.
Perhaps I should add this tracing into a debug livecd-rootfs and see if
I can reproduce in our autopkgtest infra.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to util-linux in Ubuntu.
https://bugs.launchpad.net/bugs/2045586
Title:
livecd-rootfs uses losetup -P for theoretically reliable/synchronous
partition setup but it's not reliable in noble
Status in linux package in Ubuntu:
New
Status in livecd-rootfs package in Ubuntu:
New
Status in util-linux package in Ubuntu:
New
Bug description:
In mantic, we migrated livecd-rootfs to use losetup -P instead of
kpartx, with the expectation that this would give us a reliable, race-
free way of loop-mounting partitions from a disk image during image
build.
In noble, we are finding that it is no longer reliable, and in fact
fails rather often.
It is most noticeable with riscv64 builds, which is the architecture
where we most frequently ran into problems before with kpartx. The
first riscv64+generic build in noble where the expected loop partition
device is not available is
https://launchpad.net/~ubuntu-
cdimage/+livefs/ubuntu/noble/cpc/+build/531790
The failure is however not unique to riscv64, and the autopkgtest for
the latest version of livecd-rootfs (24.04.7) - an update that
specifically tries to add more debugging code for this scenario - has
also failed on ppc64el.
https://autopkgtest.ubuntu.com/packages/l/livecd-
rootfs/noble/ppc64el
The first failure happened on November 16. While there has been an
update to the util-linux package in noble, this did not land until
November 23.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2045586/+subscriptions
More information about the foundations-bugs
mailing list