[Bug 1503286] Re: ISST-LTE: Boot of Ubuntu15.10 lpar fails: "mounting /dev/sdn2 on /root failed: Device or resource busy" [multipath]
Mauricio Faria de Oliveira
mauricfo at linux.vnet.ibm.com
Fri Feb 5 16:22:07 UTC 2016
Verified on 14.04.
Marking verification-done.
As mentioned in the description test-case, the issue is hard to reproduce.
Trying to force it to happen, I modified the updated package local-premount/multipath
script to remove/rescan the SCSI devices in the background right before the
udevadm settle command, and could not reproduce the failure.
The system booted successfully.
More assurance is given since the patch is the same as that verified by
the tester on the original environment (comment #12).
. /scripts/functions
+multipath -F
+
+for sd_delete in /sys/block/sd*/device/delete; do
+ echo 1 > $sd_delete
+done
+
+for host_scan in /sys/class/scsi_host/host*/scan; do
+ echo '- - -' > $host_scan
+done &
+
if [ -x /sbin/multipathd ]
then
[ "$quiet" != "y" ] && log_begin_msg "Waiting for udev to settle (multipath)"
udevadm settle --timeout=121 || true
[ "$quiet" != "y" ] && log_end_msg
fi
exit 0
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to multipath-tools in Ubuntu.
https://bugs.launchpad.net/bugs/1503286
Title:
ISST-LTE: Boot of Ubuntu15.10 lpar fails: "mounting /dev/sdn2 on /root
failed: Device or resource busy" [multipath]
Status in multipath-tools package in Ubuntu:
Fix Released
Status in multipath-tools source package in Trusty:
Fix Committed
Status in multipath-tools source package in Vivid:
Fix Committed
Status in multipath-tools source package in Wily:
Fix Released
Bug description:
[Impact]
Systems with disks that have long spin-up times or otherwise take a while to be detected may be affected by a failure to boot due to the drives underlying multipath devices not being available.
[Test case]
This issue is difficult to reproduce.
- Boot a system with the boot device on multipath.
This may be limited to POWER LPARs. See description below.
[Regression Potential]
Given that a new initramfs script is introduced to add a udev trigger with a timeout of 2 minutes (121 seconds), users may notice a delay of up to two minutes in booting if devices take 2 minutes or more to be brought up or detected by udev.
---
== Comment: #0 - Manjunatha H R <manjuhr1 at in.ibm.com> - 2015-09-25 11:05:36 ==
Booting of Ubuntu15.10 lpar fails and control falls to initramfs.
uname -a
--------------
Linux (none) 4.2.0-10-generic #12-Ubuntu SMP Tue Sep 15 19:46:04 UTC 2015 ppc64le GNU/Linux
Boot log:
-------------
Booting a command list
Loading Linux 4.2.0-10-generic ...
Loading initial ramdisk ...
OF stdout device is: /vdevice/vty at 30000000
Preparing to boot Linux version 4.2.0-10-generic (buildd at fisher04) (gcc version 5.2.1 20150911 (Ubuntu 5.2.1-17ubuntu4) ) #12-Ubuntu SMP Tue Sep 15 19:46:04 UTC 2015 (Ubuntu 4.2.0-10.12-generic 4.2.0)
Detected machine type: 0000000000000101
Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
command line: BOOT_IMAGE=/boot/vmlinux-4.2.0-10-generic root=UUID=822dd709-5b69-45a9-aba5-63cb55768ffb ro splash quiet topology_updates=off
memory layout at init:
memory_limit : 0000000000000000 (16 MB aligned)
alloc_bottom : 000000000bf80000
alloc_top : 0000000010000000
alloc_top_hi : 0000000010000000
rmo_top : 0000000010000000
ram_top : 0000000010000000
found display : /pci at 80000002000002c/display at 0, opening... done
instantiating rtas at 0x000000000eb60000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x000000000bf90000 -> 0x000000000bf91965
Device tree struct 0x000000000bfa0000 -> 0x000000000bfe0000
Quiescing Open Firmware ...
Booting Linux via __start() ...
-> smp_release_cpus()
spinning_secondaries = 199
<- smp_release_cpus()
<- setup_system()
[ 2.868103] [drm:radeon_device_init [radeon]] *ERROR* Unable to find PCI I/O BAR
[ 3.074553] [drm:radeon_atombios_init [radeon]] *ERROR* Unable to find PCI I/O BAR; using MMIO for ATOM IIO
[ 5.060785] lpfc 0002:90:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
Scanning for Btrfs filesystems
fsck from util-linux 2.26.2
/dev/sdn2 is in use.
e2fsck: Cannot continue, aborting.
fsck exited with status code 8
[ 36.233086] rport-0:0-9: blocked FC remote port time out: removing rport
mount: mounting /dev/sdn2 on /root failed: Device or resource busy
Target filesystem doesn't have requested /sbin/init.
mount: mounting /dev on /root/dev failed: No such file or directory
No init found. Try passing init= bootarg.
BusyBox v1.22.1 (Ubuntu 1:1.22.0-15ubuntu1) built-in shell (ash)
Enter 'help' for a list of built-in commands.
(initramfs)
-------------------------
This lpar is having multipath disks and boot disk is on a multipath disk.
Boot passes only whenever fsck tries to scan boot disk via : /dev/dm OR /dev/mapper/mpath
Boot Pass scenarios:
----------------------------
1. Boot passed when fsck tried scanning "/dev/mapper/mpathb"
fsck from util-linux 2.26.2
/dev/mapper/mpathb-part2: clean, 81802/3139584 files, 1040598/12558080 blocks
2. Boot passed when fsck tried scanning "/dev/dm-3"
Scanning for Btrfs filesystems
fsck from util-linux 2.26.2
/dev/dm-3: clean, 81802/3139584 files, 1040605/12558080 blocks
Boot fails, whenever fsck is called on /dev/sd
Boot fail scenario: Boot failed when fsck is called on "/dev/sdn"
-------------------------
Scanning for Btrfs filesystems
fsck from util-linux 2.26.2
/dev/sdn2 is in use.
e2fsck: Cannot continue, aborting.
fsck exited with status code 8
[ 36.108653] rport-0:0-9: blocked FC remote port time out: removing rport
mount: mounting /dev/sdn2 on /root failed: Device or resource busy
Target filesystem doesn't have requested /sbin/init.
mount: mounting /dev on /root/dev failed: No such file or directory
No init found. Try passing init= bootarg.
BusyBox v1.22.1 (Ubuntu 1:1.22.0-15ubuntu1) built-in shell (ash)
Enter 'help' for a list of built-in commands.
(initramfs)
-------------------------
Contact info:
----------------
Manju (manjuhr1 at in.ibm.com) A.P. (apundt at us.ibm.com)
== Comment: #12 - Mauricio Faria De Oliveira <mauricfo at br.ibm.com> - 2015-10-02 13:51:29 ==
Hi Manju and Alton,
I could not reproduce this bug in 2 attempts.
The LPAR booted successfully, using the root=UUID= parameter.
By looking at this message from the description:
> mount: mounting /dev/sdn2 on /root failed: Device or resource busy
It should have happened because multipath udev rules failed to update
the /dev/disk/by-id/<uuid> symlink from /dev/sdn to /dev/dm-X, but
multipathing the path was successful (so it got locked/in-use).
If you can reproduce it again, please leave the LPAR in the failing state (in the initramfs), reopen this bug and ping me.
I'd be happy to debug it.
Thanks!
== Comment: #15 - Mauricio Faria De Oliveira <mauricfo at br.ibm.com> - 2015-10-05 19:59:56 ==
This is probably a race between the resolve_device() call in mountroot() and the multipath discovery triggered by udev rules.
If resolve_device() runs before the root device is multipathed, $ROOT is set to an individual path (eg, /dev/sdf2) rather than its multipah device (eg, /dev/mapper/mpathb-part2), because the /dev/disk/by-uuid/<UUID> symlink is not updated yet.
The multipath discovery finishes after $ROOT is set, so the individual path becomes locked, and afterwards the root mount will be attempted on it -- this fails.
The LPAR is now patched w/ a test fix that is supposed to ensure
resolve_device() only starts after udev rules are finished.
Can you try to recreate the issue, please? Thanks!
== Comment: #16 - Mauricio Faria De Oliveira <mauricfo at br.ibm.com> - 2015-10-05 20:05:37 ==
Console messages
(!) Note: the local-premount messages (Running ... & done.) occur
around SCSI device scan/discovery time.
...
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Loading multipath modules ... [ 5.113397] device-mapper: multipath: version 1.9.0 loaded
Success: loaded module dm-multipath.
Failure: failed to load module dm-emc.
done.
Begin: Discovering multipaths ... done.
done.
Begin: Running /scripts/local-premount ... [ 5.187071] scsi 0:0:0:0: Direct-Access IBM 2107900 .850 PQ: 0 ANSI: 5
[ 5.195814] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 5.203289] sd 0:0:0:0: [sda] 62914560 512-byte logical blocks: (32.2 GB/30.0 GiB)
...
[ 5.878046] sd 0:0:3:3: [sdp] Attached SCSI disk
...
<10-20 SCSI disks via FC>
...
[ 5.923577] device-mapper: multipath round-robin: version 1.0.0 loaded
...
done.
...
If resolve_device() runs before the multipath udev rules
(the rules multipath the root device and update the /dev/disk/by-uuid symlink of $ROOT)
this happens:
Begin: Checking root file system ... fsck from util-linux 2.26.2
/dev/sdf2 is in use.
e2fsck: Cannot continue, aborting.
fsck exited with status code 8
done.
Warning: File system check failed but did not detect errors
mount: mounting /dev/sdf2 on /root failed: Device or resource busy
done.
Target filesystem doesn't have requested /sbin/init.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ...
...
mount: mounting /dev on /root/dev failed: No such file or directory
done.
No init found. Try passing init= bootarg.
...
(initramfs)
So, /dev/sdf2 is in use ... and hits Device or resource busy.
This comes from $ROOT.
However, the root=UUID= symlink points to the multipath device:
(initramfs) echo $ROOT
/dev/sdf2
(initramfs) cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.2.0-12-generic root=UUID=44bd8a6e-8613-431a-9335-879d8cf5d0e4 ro
(initramfs) ls -l /dev/disk/by-uuid/44bd8a6e-8613-431a-9335-879d8cf5d0e4
lrwxrwxrwx 1 11 /dev/disk/by-uuid/44bd8a6e-8613-431a-9335-879d8cf5d0e4 -> ../../dm-19
(initramfs) ls -l /dev/sdf
brw------- 1 8, 80 /dev/sdf
(initramfs) dmsetup table | grep 8:80
mpathb: 0 104857600 multipath 0 0 1 1 round-robin 0 4 1 8:80 1 8:16 1 8:144 1 8:208 1
It's probably because resolve_device() was racing w/ the multipath discoveries from udev rules.
resolve_device() finished before the /dev/disk/by-uuid/ symlink was updated by multipath discovery.
Code:
initramfs :: /init
log_begin_msg "Mounting root file system"
...
mountroot
log_end_msg
(!) Note: message "Mounting root fs" and call to mountroot()
initramfs :: /scripts/local
mountroot()
{
local_mount_root
}
local_mount_root()
{
...
local_premount
ROOT=$(resolve_device "$ROOT")
...
}
(!) Note: local_premount() is the last call before resolve_device()
is called
local_premount()
{
...
[ "$quiet" != "y" ] && log_begin_msg "Running /scripts/local-premount"
run_scripts /scripts/local-premount
[ "$quiet" != "y" ] && log_end_msg
...
}
So we're testing a call to 'udevadm settle' in /scripts/local-
premount/multipath script.
== Comment: #17 - Mauricio Faria De Oliveira <mauricfo at br.ibm.com> -
2015-10-05 20:08:26 ==
== Comment: #18 - Manjunatha H R <manjuhr1 at in.ibm.com> - 2015-10-06 06:05:30 ==
Thank you Mauricio for a quick fix!!
Lpar is booting up properly without seeing device or resource busy errors.
== Comment: #20 - Mauricio Faria De Oliveira <mauricfo at br.ibm.com> - 2015-10-06 09:04:50 ==
Confirmed w/ Manju the # of tests.
10:00:41 AM: Manjunatha H R: Hi Mauricio, I tried around 10 boots..
10:01:07 AM: Manjunatha H R: all times it booted up..
Sounds good.
I'll be sending a patch/mirroring.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1503286/+subscriptions
More information about the foundations-bugs
mailing list