[Bug 1925211] Re: Hot-unplug of disks leaves broken block devices around in Hirsute on s390x
Christian Ehrhardt
1925211 at bugs.launchpad.net
Wed Apr 21 07:08:03 UTC 2021
I was wondering if I could trigger the same issue on an lpar as it would
raise the severity IMHO. I have no claim on completeness of these tests
in regard to all that could happen. I tried what I considered low
hanging fruits in regard to this cross check.
Pre-condition each time
- a dasd attached to the system
- not used e.g. as a FS
- no aliases enabled
=> this (more or less) matches our former KVM based test case
$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183 3390/0c 3990/e9 yes f0 f0 ff 10111213 00000000
Bus-ID Status Name Device Type BlkSz Size Blocks
================================================================================
0.0.1523 active dasdc 94:8 ECKD 4096 7043MB 1803060
brw-rw---- 1 root disk 94, 8 Apr 21 06:21 /dev/dasdc
I was tracking the same state after the removing action and ran udevadm monitor to see is a unbind happened.
---
#1 cio purge
$ sudo cio_ignore -a 0.0.1523; sudo cio_ignore --purge
=> can't take away online devices, and I'm not interested in initial
blocking ..
---
#2 chzdev
$ sudo chzdev --disable 0.0.1523
=> properly removed
---
#3 remove the dasds on the storage server
"LSS 08 SRV_SS0_0823" is mapped to s1lp5 0.0.1523 - removing that on the storage server
By default that fails:
Error - delete of volume SRV_SS0_0823 failed.
8:28 AM
Error: CMUN02948E IBM.2107-75DXP71/0823 The Delete logical volume task cannot be initiated because the Allow Host Pre-check Control Switch is set to true and the volume that you have specified is online to a host.
In the old UI the force option is available as checkbox - trying via that.
Done.
The system does not realize that the disk is gone, I/O on it (e.g. dasdfmt) goes into a deadlock.
After a while in that hang the system realizes it is in trouble:
dmesg:
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): I/O status report for device 0.0.1523:
dasd(eckd): in req: 00000000e903a5ac CC:00 FC:00 AC:00 SC:00 DS:00 CS:00 RC:-11
dasd(eckd): device 0.0.1523: Failing CCW: 0000000000000000
dasd(eckd): SORRY - NO VALID SENSE AVAILABLE
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): Related CP in req: 00000000e903a5ac
dasd(eckd): CCW 00000000c3e100c4: 2760000C 014C5FF0 DAT: 18000000 08231c00 00000000
dasd(eckd): CCW 00000000335dd238: 3E20401A 00A40000 DAT: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd):......
Apr 21 06:42:32 s1lp5 kernel: dasd-eckd.adb621: 0.0.1523: ERP failed for the DASD
udevadm:
KERNEL[1313.022835] remove /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 (block)
UDEV [1313.024648] remove /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 (block)
Even after the above - the disk is still "present":
$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183 3390/0c 3990/e9 yes f0 f0 0f 10111213 00000000
Bus-ID Status Name Device Type BlkSz Size Blocks
================================================================================
0.0.1523 active dasdc 94:8 ECKD 4096 7043MB 1803060
brw-rw---- 1 root disk 94, 8 Apr 21 06:26 /dev/dasdc
Only when I detach it from the system via chzdev the hanging processes get un-stuck and the device removed.
So maybe case #3 is a good one ... ?
Trying the same with a Focal kernel that didn't have the issue we've seen in KVM-disk-detach.
=> 5.4.0-72-generic
The behavior is rather similar to the new 5.11 kernel for this.
Thereby, while not complete, we still can make an assumption that this
case really might only affect the detach of KVM disks. Is that good: no;
But is it so bad that we need to interrupt the kernel cycle, no IMHO it
is not.
So IMHO this can go on into the next normal Kernel SRU cycle and also
gives us a chance for the IBM Developers to chime in which of the
proposed solutions they want.
** Changed in: udev (Ubuntu Hirsute)
Status: New => Invalid
** Changed in: systemd (Ubuntu Hirsute)
Status: New => Invalid
** Changed in: linux (Ubuntu Hirsute)
Status: Confirmed => Triaged
** Changed in: linux (Ubuntu Hirsute)
Importance: Undecided => High
** Changed in: udev (Ubuntu Hirsute)
Importance: Critical => Undecided
** Changed in: ubuntu-z-systems
Status: New => Triaged
** Changed in: ubuntu-z-systems
Importance: Undecided => High
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/1925211
Title:
Hot-unplug of disks leaves broken block devices around in Hirsute on
s390x
Status in Ubuntu on IBM z Systems:
Triaged
Status in linux package in Ubuntu:
Triaged
Status in systemd package in Ubuntu:
Invalid
Status in udev package in Ubuntu:
Invalid
Status in linux source package in Hirsute:
Triaged
Status in systemd source package in Hirsute:
Invalid
Status in udev source package in Hirsute:
Invalid
Bug description:
Repro:
#1 Get a guest
$ uvt-kvm create --disk 5 --password=ubuntu h release=hirsute arch=s390x label=daily
$ uvt-kvm wait h release=hirsute arch=s390x label=daily
#2 Attach and Detach disk
$ sudo qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 10M
$ virsh attach-disk h /var/lib/libvirt/images/test.qcow2 vdc
$ virsh detach-disk h vdc
From libvirts POV it is gone at this point
$ virsh domblklist h
Target Source
------------------------------------------------------------------
vda /var/lib/uvtool/libvirt/images/hirsute-2nd-zfs.qcow
vdb /var/lib/uvtool/libvirt/images/hirsute-2nd-zfs-ds.qcow
But the guest thinks still it is present
$ uvt-kvm ssh --insecure hirsute-2nd-zfs lsblk
...
vdc 252:32 0 20M 0 disk
This even remains a while after (not a race).
Any access to it in the guest will hang (as you'd expect of a non-existing blockdev)
4 0 1758 1739 20 0 12140 4800 - S+ pts/0 0:00 | \_ sudo mkfs.ext4 /dev/vdc
4 0 1759 1758 20 0 6924 1044 - D+ pts/0 0:00 | \_ mkfs.ext4 /dev/vdc
The result above was originally found with hirsute-guest at hirsute-host
on s390x
I do NOT see the same with groovy-guest at hirsute-host on s390x
I DO see the same with hirsute-guest at groovy-host on s390x
=> Guest version dependent not Host/Hipervisor dependent
I DO see the same with ZFS disks AND LVM disks being added&removed
=> not type dependent
I do NOT see the same on x86.
=> Arch dependent ??
... the evidence slowly points towards an issue in the guest, damn we are so
close to release - but non-fully detaching disks are critical in my POV :-/
Filing this as-is for awareness, but certainly this will need more debugging.
Unsure where this is going to eventually I'll now file it for kernel/udev/systemd.
If there are any known issues/components that are related let me know please!
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu65
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CRDA: N/A
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci:
Lspci-vt: -[0000:00]-
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t: Error: command ['lsusb', '-t'] failed with exit code 1: /sys/bus/usb/devices: No such file or directory
Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
Package: udev
PackageArchitecture: s390x
PciMultimedia:
ProcFB:
ProcKernelCmdLine: root=LABEL=cloudimg-rootfs
ProcVersionSignature: User Name 5.11.0-14.15-generic 5.11.12
RelatedPackageVersions:
linux-restricted-modules-5.11.0-14-generic N/A
linux-backports-modules-5.11.0-14-generic N/A
linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: hirsute uec-images
Uname: Linux 5.11.0-14-generic s390x
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy lxd netdev plugdev sudo video
_MarkForUpload: True
acpidump:
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1925211/+subscriptions
More information about the foundations-bugs
mailing list