[Bug 1717224] Comment bridged from LTC Bugzilla
bugproxy
bugproxy at us.ibm.com
Tue Oct 17 22:40:25 UTC 2017
------- Comment From swgreenl at us.ibm.com 2017-10-17 18:34 EDT-------
Hi folks.
Good news! We got a test window on the Ubuntu KVM host today.
We provisioned a collection of 24 new virtual Ubuntu guests for this
test. Each virtual domain uses a single qcow2 virtual boot volume. All
guests are configured exactly the same (except guests zs93kag100080,
zs93kag100081 and zs93kag100082 are on a macvtap interface. Otherwise,
identical.).
Here's a sample of one (running) guest's XML:
ubuntu at zm93k8:/home/scottg$ virsh dumpxml zs93kag100080
<domain type='kvm' id='65'>
<name>zs93kag100080</name>
<uuid>6bd4ebad-414b-4e1e-9995-7d061331ec01</uuid>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>2</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='s390x' machine='s390-ccw-virtio-xenial'>hvm</type>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>preserve</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag100080.qcow2'/>
<backingStore type='file' index='1'>
<format type='raw'/>
<source file='/rawimages/ubu1604qcow2/ubuntu.1604-1.20161206.v1.raw.backing'/>
<backingStore/>
</backingStore>
<target dev='vda' bus='virtio'/>
<boot order='1'/>
<alias name='virtio-disk0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag100080.prm'/>
<backingStore/>
<target dev='vdc' bus='virtio'/>
<alias name='virtio-disk2'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0006'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<readonly/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0' model='none'>
<alias name='usb'/>
</controller>
<controller type='scsi' index='0' model='virtio-scsi'>
<alias name='scsi0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0002'/>
</controller>
<interface type='bridge'>
<mac address='02:00:00:00:40:80'/>
<source bridge='ovsbridge1'/>
<vlan>
<tag id='1297'/>
</vlan>
<virtualport type='openvswitch'>
<parameters interfaceid='cd58c548-0b1f-47e7-9ed5-ad4a1bc8b8e0'/>
</virtualport>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
</interface>
<console type='pty' tty='/dev/pts/3'>
<source path='/dev/pts/3'/>
<target type='sclp' port='0'/>
<alias name='console0'/>
</console>
<memballoon model='none'>
<alias name='balloon0'/>
</memballoon>
</devices>
<seclabel type='dynamic' model='apparmor' relabel='yes'>
<label>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</label>
<imagelabel>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</imagelabel>
</seclabel>
</domain>
To set up the test, we shutdown all virtual domains, and then ran a
script which simply starts the guests, one at a time and captures fs
.aio-nr before / after each 'virsh start'.
After attempting to start all guests in the list, it goes into a loop,
checking fs.aio-nr once every minute for 10 minutes to see if that value
changes (which it does not).
ubuntu at zm93k8:/home/scottg$ ./start_macvtaps_debug.sh
Test started at Tue Oct 17 17:48:29 EDT 2017
cat /proc/sys/fs/aio-max-nr
65535
fs.aio-nr = 0
Starting zs93kag100080 ; Count = 1
zs93kag100080 started succesfully ...
fs.aio-nr = 6144
Starting zs93kag100081 ; Count = 2
zs93kag100081 started succesfully ...
fs.aio-nr = 12288
Starting zs93kag100082 ; Count = 3
zs93kag100082 started succesfully ...
fs.aio-nr = 18432
Starting zs93kag100083 ; Count = 4
zs93kag100083 started succesfully ...
fs.aio-nr = 24576
Starting zs93kag100084 ; Count = 5
zs93kag100084 started succesfully ...
fs.aio-nr = 30720
Starting zs93kag100085 ; Count = 6
zs93kag100085 started succesfully ...
fs.aio-nr = 36864
Starting zs93kag70024 ; Count = 7
zs93kag70024 started succesfully ...
fs.aio-nr = 43008
Starting zs93kag70025 ; Count = 8
zs93kag70025 started succesfully ...
fs.aio-nr = 49152
Starting zs93kag70026 ; Count = 9
zs93kag70026 started succesfully ...
fs.aio-nr = 55296
Starting zs93kag70027 ; Count = 10
zs93kag70027 started succesfully ...
fs.aio-nr = 61440
Starting zs93kag70038 ; Count = 11
zs93kag70038 started succesfully ...
fs.aio-nr = 67584
Starting zs93kag70039 ; Count = 12
zs93kag70039 started succesfully ...
fs.aio-nr = 73728
Starting zs93kag70040 ; Count = 13
zs93kag70040 started succesfully ...
fs.aio-nr = 79872
Starting zs93kag70043 ; Count = 14
zs93kag70043 started succesfully ...
fs.aio-nr = 86016
Starting zs93kag70045 ; Count = 15
zs93kag70045 started succesfully ...
fs.aio-nr = 92160
Starting zs93kag70046 ; Count = 16
zs93kag70046 started succesfully ...
fs.aio-nr = 98304
Starting zs93kag70047 ; Count = 17
zs93kag70047 started succesfully ...
fs.aio-nr = 104448
Starting zs93kag70048 ; Count = 18
zs93kag70048 started succesfully ...
fs.aio-nr = 110592
Starting zs93kag70049 ; Count = 19
zs93kag70049 started succesfully ...
fs.aio-nr = 116736
Starting zs93kag70050 ; Count = 20
zs93kag70050 started succesfully ...
fs.aio-nr = 122880
Starting zs93kag70051 ; Count = 21
zs93kag70051 started succesfully ...
fs.aio-nr = 129024
Starting zs93kag70052 ; Count = 22
Error starting guest zs93kag70052 .
error: Failed to start domain zs93kag70052
error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:06.684444Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70052.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor
fs.aio-nr = 129024
Starting zs93kag70053 ; Count = 23
Error starting guest zs93kag70053 .
error: Failed to start domain zs93kag70053
error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:07.933457Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70053.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor
fs.aio-nr = 129024
Starting zs93kag70054 ; Count = 24
Error starting guest zs93kag70054 .
error: Failed to start domain zs93kag70054
error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:09.084863Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70054.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor
fs.aio-nr = 129024
Monitor fs.aio-nr for 10 minutes, capture value every 60 seconds...
Sleeping 60 seconds. Loop count = 1
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 2
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 3
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 4
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 5
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 6
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 7
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 8
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 9
fs.aio-nr = 129024
Sleeping 60 seconds. Loop count = 10
fs.aio-nr = 129024
Test completed successfully.
## I couldn't understand why the error messages on start up were different this time,
## however it seems to be the same underlying cause. That is, if I stop one domain, I am ## then able to successfully start a failed domain. For example,
ubuntu at zm93k8:/home/scottg$ virsh start zs93kag70052
Domain zs93kag70052 started
ubuntu at zm93k8:/home/scottg$ virsh list |grep zs93kag70052
89 zs93kag70052 running
ubuntu at zm93k8:/home/scottg$
## And now, if I try to start zs93kag70051 (which started fine the first
time), it fails (with yet a different error):
ubuntu at zm93k8:/home/scottg$ virsh start zs93kag70051
error: Disconnected from qemu:///system due to I/O error
error: Failed to start domain zs93kag70051
error: End of file while reading data: Input/output error
error: One or more references were leaked after disconnect from the hypervisor
ubuntu at zm93k8:/home/scottg$
ubuntu at zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:16:18 EDT 2017
fs.aio-nr = 129024
## This time, I will kill one of the ovs-osa networked guests, and see
if that then allows me to start zs93kag70051 ... (it does)
ubuntu at zm93k8:/home/scottg$ date;virsh destroy zs93kag100080
Tue Oct 17 18:18:29 EDT 2017
Domain zs93kag100080 destroyed
ubuntu at zm93k8:/home/scottg$ sysctl fs.aio-nr
Tue Oct 17 18:19:18 EDT 2017
fs.aio-nr = 122880
ubuntu at zm93k8:/home/scottg$ date;virsh start zs93kag70051
Tue Oct 17 18:18:41 EDT 2017
Domain zs93kag70051 started
ubuntu at zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:18:52 EDT 2017
fs.aio-nr = 129024
## It appears that fs.aio-nr = 129024 is "The Brick Wall".
## Now, let's try increasing fs.aio-max-nr to 4194304 and see if that
allows me to start more guests (it does).
ubuntu at zm93k8:/home/scottg$ sudo sysctl -p /etc/sysctl.conf
fs.aio-max-nr = 4194304
ubuntu at zm93k8:/home/scottg$ cat /proc/sys/fs/aio-max-nr
4194304
ubuntu at zm93k8:/home/scottg$ date;virsh start zs93kag70051
Tue Oct 17 18:27:54 EDT 2017
Domain zs93kag70051 started
ubuntu at zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:28:12 EDT 2017
fs.aio-nr = 129024
ubuntu at zm93k8:/home/scottg$ date;virsh start zs93kag70053
Tue Oct 17 18:29:38 EDT 2017
Domain zs93kag70053 started
ubuntu at zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:29:42 EDT 2017
fs.aio-nr = 135168
ubuntu at zm93k8:/home/scottg$ date;virsh start zs93kag70054
Tue Oct 17 18:29:55 EDT 2017
Domain zs93kag70054 started
ubuntu at zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:29:58 EDT 2017
fs.aio-nr = 141312
I saved dmesg output in case you need that.
ubuntu at zm93k8:/home/scottg$ dmesg > dmesg.out.Oct17_bug157241
I will also keep this test environment up for a couple days in case you
need additional data.
Thank you.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to procps in Ubuntu.
https://bugs.launchpad.net/bugs/1717224
Title:
virsh start of virtual guest domain fails with internal error due to
low default aio-max-nr sysctl value
Status in Ubuntu on IBM z Systems:
In Progress
Status in kvm package in Ubuntu:
Confirmed
Status in linux package in Ubuntu:
In Progress
Status in procps package in Ubuntu:
New
Status in kvm source package in Xenial:
New
Status in linux source package in Xenial:
In Progress
Status in procps source package in Xenial:
New
Status in kvm source package in Zesty:
New
Status in linux source package in Zesty:
In Progress
Status in procps source package in Zesty:
New
Status in kvm source package in Artful:
Confirmed
Status in linux source package in Artful:
In Progress
Status in procps source package in Artful:
New
Bug description:
Starting virtual guests via on Ubuntu 16.04.2 LTS installed with its
KVM hypervisor on an IBM Z14 system LPAR fails on the 18th guest with
the following error:
root at zm93k8:/rawimages/ubu1604qcow2# virsh start zs93kag70038
error: Failed to start domain zs93kag70038
error: internal error: process exited while connecting to monitor: 2017-07-26T01:48:26.352534Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70038.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not open backing file: Could not set AIO state: Inappropriate ioctl for device
The previous 17 guests started fine:
root at zm93k8# virsh start zs93kag70020
Domain zs93kag70020 started
root at zm93k8# virsh start zs93kag70021
Domain zs93kag70021 started
.
.
root at zm93k8:/rawimages/ubu1604qcow2# virsh start zs93kag70036
Domain zs93kag70036 started
We ended up fixing the issue by adding the following line to /etc/sysctl.conf :
fs.aio-max-nr = 4194304
... then, reload the sysctl config file:
root at zm93k8:/etc# sysctl -p /etc/sysctl.conf
fs.aio-max-nr = 4194304
Now, we're able to start more guests...
root at zm93k8:/etc# virsh start zs93kag70036
Domain zs93kag70036 started
The default value was originally set to 65535:
root at zm93k8:/rawimages/ubu1604qcow2# cat /proc/sys/fs/aio-max-nr
65536
Note, we chose the 4194304 value, because this is what our KVM on System Z hypervisor ships as its default value. Eg. on our zKVM system:
[root at zs93ka ~]# cat /proc/sys/fs/aio-max-nr
4194304
ubuntu at zm93k8:/etc$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial
ubuntu at zm93k8:/etc$
ubuntu at zm93k8:/etc$ dpkg -s qemu-kvm |grep Version
Version: 1:2.5+dfsg-5ubuntu10.8
Is something already documented for Ubuntu KVM users warning them about the low default value, and some guidance as to
how to select an appropriate value? Also, would you consider increasing the default aio-max-nr value to something much
higher, to accommodate significantly more virtual guests?
Thanks!
---uname output---
ubuntu at zm93k8:/etc$ uname -a Linux zm93k8 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:12:54 UTC 2017 s390x s390x s390x GNU/Linux
Machine Type = z14
---Debugger---
A debugger is not configured
---Steps to Reproduce---
See Problem Description.
The problem was happening a week ago, so this may not reflect that
activity.
This file was collected on Aug 7, one week after we were hitting the
problem. If I need to reproduce the problem and get fresh data,
please let me know.
/var/log/messages doesn't exist on this system, so I provided syslog
output instead.
All data have been collected too late after the problem was observed
over a week ago. If you need me to reproduce the problem and get new
data, please let me know. That's not a problem.
Also, we would have to make special arrangements for login access to
these systems. I'm happy to run traces and data collection for you as
needed. If that's not sufficient, then we'll explore log in access
for you.
Thanks... - Scott G.
I was able to successfully recreate the problem and captured / attached new debug docs.
Recreate procedure:
# Started out with no virtual guests running.
ubuntu at zm93k8:/home/scottg$ virsh list
Id Name State
----------------------------------------------------
# Set fs.aio-max-nr back to original Ubuntu "out of the box" value in /etc/sysctl.conf
ubuntu at zm93k8:~$ tail -1 /etc/sysctl.conf
fs.aio-max-nr = 65536
## sysctl -a shows:
fs.aio-max-nr = 4194304
## Reload sysctl.
ubuntu at zm93k8:~$ sudo sysctl -p /etc/sysctl.conf
fs.aio-max-nr = 65536
ubuntu at zm93k8:~$
ubuntu at zm93k8:~$ sudo sysctl -a |grep fs.aio-max-nr
fs.aio-max-nr = 65536
ubuntu at zm93k8:~$ cat /proc/sys/fs/aio-max-nr
65536
# Attempt to start more than 17 qcow2 virtual guests on the Ubuntu
host. Fails on the 18th XML.
Script used to start guests..
ubuntu at zm93k8:/home/scottg$ date;./start_privs.sh
Wed Aug 23 13:21:25 EDT 2017
virsh start zs93kag70015
Domain zs93kag70015 started
Started zs93kag70015 succesfully ...
virsh start zs93kag70020
Domain zs93kag70020 started
Started zs93kag70020 succesfully ...
virsh start zs93kag70021
Domain zs93kag70021 started
Started zs93kag70021 succesfully ...
virsh start zs93kag70022
Domain zs93kag70022 started
Started zs93kag70022 succesfully ...
virsh start zs93kag70023
Domain zs93kag70023 started
Started zs93kag70023 succesfully ...
virsh start zs93kag70024
Domain zs93kag70024 started
Started zs93kag70024 succesfully ...
virsh start zs93kag70025
Domain zs93kag70025 started
Started zs93kag70025 succesfully ...
virsh start zs93kag70026
Domain zs93kag70026 started
Started zs93kag70026 succesfully ...
virsh start zs93kag70027
Domain zs93kag70027 started
Started zs93kag70027 succesfully ...
virsh start zs93kag70028
Domain zs93kag70028 started
Started zs93kag70028 succesfully ...
virsh start zs93kag70029
Domain zs93kag70029 started
Started zs93kag70029 succesfully ...
virsh start zs93kag70030
Domain zs93kag70030 started
Started zs93kag70030 succesfully ...
virsh start zs93kag70031
Domain zs93kag70031 started
Started zs93kag70031 succesfully ...
virsh start zs93kag70032
Domain zs93kag70032 started
Started zs93kag70032 succesfully ...
virsh start zs93kag70033
Domain zs93kag70033 started
Started zs93kag70033 succesfully ...
virsh start zs93kag70034
Domain zs93kag70034 started
Started zs93kag70034 succesfully ...
virsh start zs93kag70035
Domain zs93kag70035 started
Started zs93kag70035 succesfully ...
virsh start zs93kag70036
error: Failed to start domain zs93kag70036
error: internal error: process exited while connecting to monitor: 2017-08-23T17:21:47.131809Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70036.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not open backing file: Could not set AIO state: Inappropriate ioctl for device
Exiting script ... start zs93kag70036 failed
ubuntu at zm93k8:/home/scottg$
# Show that there are only 17 running guests.
ubuntu at zm93k8:/home/scottg$ virsh list |grep run |wc -l
17
ubuntu at zm93k8:/home/scottg$ virsh list
Id Name State
----------------------------------------------------
25 zs93kag70015 running
26 zs93kag70020 running
27 zs93kag70021 running
28 zs93kag70022 running
29 zs93kag70023 running
30 zs93kag70024 running
31 zs93kag70025 running
32 zs93kag70026 running
33 zs93kag70027 running
34 zs93kag70028 running
35 zs93kag70029 running
36 zs93kag70030 running
37 zs93kag70031 running
38 zs93kag70032 running
39 zs93kag70033 running
40 zs93kag70034 running
41 zs93kag70035 running
# For fun, try starting zs93kag70036 again manually.
ubuntu at zm93k8:/home/scottg$ date;virsh start zs93kag70036
Wed Aug 23 13:27:28 EDT 2017
error: Failed to start domain zs93kag70036
error: internal error: process exited while connecting to monitor: 2017-08-23T17:27:30.031782Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70036.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not open backing file: Could not set AIO state: Inappropriate ioctl for device
# Show the XML (they're all basically the same)...
ubuntu at zm93k8:/home/scottg$ cat zs93kag70036.xml
<domain type='kvm'>
<name>zs93kag70036</name>
<memory unit='MiB'>4096</memory>
<currentMemory unit='MiB'>2048</currentMemory>
<vcpu placement='static'>2</vcpu>
<os>
<type arch='s390x' machine='s390-ccw-virtio'>hvm</type>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>preserve</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name ='qemu' type='qcow2' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag70036.qcow2'/>
<target dev='vda' bus='virtio'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
<boot order='1'/>
</disk>
<interface type='network'>
<source network='privnet1'/>
<model type='virtio'/>
<mac address='52:54:00:70:d0:36'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
</interface>
<!--
<disk type='block' device='disk'>
<driver name ='qemu' type='raw' cache='none'/>
<source dev='/dev/disk/by-id/dm-uuid-mpath-36005076802810e5540000000000006e4'/>
<target dev='vde' bus='virtio'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0005'/>
<readonly/>
</disk>
-->
<disk type='file' device='disk'>
<driver name ='qemu' type='raw' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag70036.prm'/>
<target dev='vdf' bus='virtio'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0006'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/guestimages/data1/zs93kag70036.iso'/>
<target dev='sda' bus='scsi'/>
<readonly/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0' model='none'/>
<memballoon model='none'/>
<console type='pty'>
<target type='sclp' port='0'/>
</console>
</devices>
</domain>
This condition is very easy to replicate. However, we may be losing this system in the next day or two, so please let me know ASAP if you need any more data. Thank you...
- Scott G.
== Comment: #11 - Viktor Mihajlovski <MIHAJLOV at de.ibm.com> - 2017-09-14
In order to support many KVM guests it is advisable to raise the aio-max-nr as suggested in the problem description, see also http://kvmonz.blogspot.co.uk/p/blog-page_7.html. I would also suggest that the system default setting is increased.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1717224/+subscriptions
More information about the foundations-bugs
mailing list