[Bug 1587039] Please test proposed package
James Page
james.page at ubuntu.com
Tue Jun 7 13:16:37 UTC 2016
Hello Amad, or anyone else affected,
Accepted qemu into kilo-proposed. The package will build now and be
available in the Ubuntu Cloud Archive in a few hours, and then in the
-proposed repository.
Please help us by testing this new package. To enable the -proposed
repository:
sudo add-apt-repository cloud-archive:kilo-proposed
sudo apt-get update
Your feedback will aid us getting this update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, and change the tag
from verification-kilo-needed to verification-kilo-done. If it does not
fix the bug for you, please add a comment stating that, and change the
tag to verification-kilo-failed. In either case, details of your testing
will help us make a better decision.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance!
** Changed in: cloud-archive/kilo
Status: New => Fix Committed
** Tags added: verification-kilo-needed
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1587039
Title:
aio: strengthen memory barriers for bottom half scheduling
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive kilo series:
Fix Committed
Bug description:
[Impact]
There are two problems with memory barriers in async.c. The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.
First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled. If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.
Second, the memory barrier in aio_bh_poll is too weak. Without this
patch, it is possible that bh->scheduled = 0 is not "published" until
after the callback has returned. Another thread wants to schedule the
bottom half, but it sees bh->scheduled = 1 and does nothing. This causes
a lost wakeup. The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.
Both of these involve a store and a load, so they are reproducible on
x86_64 as well. It is however much easier on aarch64, where the
libguestfs test suite triggers the bug fairly easily. Even there the
failure can go away or appear depending on compiler optimization level,
tracing options, or even kernel debugging options.
[Test Case]
Paul Leveille however reported how to trigger the problem within 15
minutes on x86_64 as well. His (untested) recipe, reproduced here
for reference, is the following:
1) Qcow2 (or 3) is critical – raw files alone seem to avoid the
problem.
2) Use “cache=directsync” rather than the default of
“cache=none” to make it happen easier.
3) Use a server with a write-back RAID controller to allow for rapid
IO rates.
4) Run a random-access load that (mostly) writes chunks to various
files on the virtual block device.
a. I use ‘diskload.exe c:25’, a Microsoft HCT load
generator, on Windows VMs.
b. Iometer can probably be configured to generate a similar
load.
5) Run multiple VMs in parallel, against the same storage device,
to shake the failure out sooner.
6) IvyBridge and Haswell processors for certain; not sure about
others.
A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.
[Regression Potential]
There should be minimal regression potential in this patch
[Original text below]
I have a strong belief that the guest hanging is caused by the bug
described in this upstream commit: https://lists.gnu.org/archive/html
/qemu-devel/2015-04/msg00625.html. I have managed to capture a core-
dump of a qemu-process with a hanging VM. That has shown that qemu is
waiting for a io-request to be completed, but it never happens. After
some more analyzing I can see that the request actually is completed
but the completion-routine seems to have never been called. From the
core-dump analyze and the bug-description in the commit that I
mention, I strongly suspect the memory-barriers is the problem.
Unfortunately I have not yet found a good and fast way to prove that
is the case. The tricky thing is that the guest needs to dispatch a
write (or flush) that is a barrier (i.e no other requests will be
dispatched until that request is finished) and that exact request
needs to be the one for which the completion is lost. This is hard to
accomplish.
I have been running a test for a couple of days (~4 days) with a qemu
without the fix and one with the fix included
(https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg00625.html).
The one without the fix has gotten several "hung_task" messages. The
other one as not yet received such a message.
[53280.284059] INFO: task flush-253:16:304 blocked for more than 120 seconds.
[53280.285546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53280.287046] flush-253:16 D 0000000000000001 0 304 2 0x00000000
[53280.287058] ffff880138731710 0000000000000046 000000000000000e ffff88013fd13770
[53280.288652] ffff880138731fd8 ffff880138731fd8 ffff880138731fd8 0000000000013700
[53280.290255] ffff880138672de0 ffff880136fc96f0 0000000000000001 ffff88006f1d5cf8
[53280.291772] Call Trace:
[53280.292324] [<ffffffff8165badf>] schedule+0x3f/0x60
[53280.293294] [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
[53280.294431] [<ffffffff8165c482>] mutex_lock+0x22/0x40
[53280.295379] [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
[53280.296493] [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
[53280.297681] [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
[53280.298693] [<ffffffff812567d0>] journal_start+0xc0/0x100
[53280.299751] [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
[53280.300868] [<ffffffff811f7f68>] ext3_journalled_writepage+0xd8/0x240
[53280.302078] [<ffffffff81121de7>] __writepage+0x17/0x40
[53280.303053] [<ffffffff81122518>] write_cache_pages+0x208/0x460
[53280.304095] [<ffffffff81121dd0>] ? set_page_dirty_lock+0x50/0x50
[53280.305207] [<ffffffff811227ba>] generic_writepages+0x4a/0x70
[53280.306275] [<ffffffff81123725>] do_writepages+0x35/0x40
[53280.307225] [<ffffffff811a2850>] writeback_single_inode+0x170/0x420
[53280.308407] [<ffffffff811a2f16>] writeback_sb_inodes+0x1b6/0x270
[53280.309542] [<ffffffff811a306e>] __writeback_inodes_wb+0x9e/0xd0
[53280.310745] [<ffffffff811a330b>] wb_writeback+0x26b/0x310
[53280.311704] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
[53280.312756] [<ffffffff811949b2>] ? get_nr_dirty_inodes+0x52/0x80
[53280.313914] [<ffffffff811a344f>] wb_check_old_data_flush+0x9f/0xb0
[53280.315027] [<ffffffff811a4321>] wb_do_writeback+0x151/0x1d0
[53280.316087] [<ffffffff810777b0>] ? usleep_range+0x50/0x50
[53280.317103] [<ffffffff811a4423>] bdi_writeback_thread+0x83/0x280
[53280.318208] [<ffffffff811a43a0>] ? wb_do_writeback+0x1d0/0x1d0
[53280.319277] [<ffffffff8108ae3c>] kthread+0x8c/0xa0
[53280.320213] [<ffffffff81668174>] kernel_thread_helper+0x4/0x10
[53280.321286] [<ffffffff8108adb0>] ? flush_kthread_worker+0xa0/0xa0
[53280.322398] [<ffffffff81668170>] ? gs_change+0x13/0x13
[53280.323369] INFO: task dd:25713 blocked for more than 120 seconds.
[53280.324480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53280.332587] dd D 0000000000000001 0 25713 389 0x00000000
[53280.332591] ffff88011870d918 0000000000000082 ffff88011870d938 ffffffff8105fc08
[53280.334156] ffff88011870dfd8 ffff88011870dfd8 ffff88011870dfd8 0000000000013700
[53280.335696] ffff88013683ade0 ffff880136fc8000 0000000000000002 ffff88006f1d5cf8
[53280.337258] Call Trace:
[53280.337781] [<ffffffff8105fc08>] ? load_balance+0x78/0x370
[53280.338803] [<ffffffff8165badf>] schedule+0x3f/0x60
[53280.339738] [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
[53280.340868] [<ffffffff8165c482>] mutex_lock+0x22/0x40
[53280.341841] [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
[53280.342880] [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
[53280.343975] [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
[53280.345023] [<ffffffff812567d0>] journal_start+0xc0/0x100
[53280.346043] [<ffffffff81118ec8>] ? grab_cache_page_write_begin+0x78/0xe0
[53280.347194] [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
[53280.348271] [<ffffffff811f9eef>] ext3_write_begin+0x9f/0x270
[53280.349276] [<ffffffff8111839a>] generic_perform_write+0xca/0x210
[53280.350368] [<ffffffff811fbd4d>] ? ext3_dirty_inode+0x6d/0xa0
[53280.351327] [<ffffffff8111853d>] generic_file_buffered_write+0x5d/0x90
[53280.352457] [<ffffffff81119ea9>] __generic_file_aio_write+0x229/0x440
[53280.353582] [<ffffffff8111a132>] generic_file_aio_write+0x72/0xe0
[53280.354708] [<ffffffff81178e0a>] do_sync_write+0xda/0x120
[53280.355702] [<ffffffff812d9588>] ? apparmor_file_permission+0x18/0x20
[53280.356887] [<ffffffff8129ec6c>] ? security_file_permission+0x2c/0xb0
[53280.358092] [<ffffffff811793b1>] ? rw_verify_area+0x61/0xf0
[53280.359149] [<ffffffff81179713>] vfs_write+0xb3/0x180
[53280.360671] [<ffffffff81179a3a>] sys_write+0x4a/0x90
[53280.361613] [<ffffffff81666002>] system_call_fastpath+0x16/0x1b
Setup:
DELL (R620) machine with 4 computes and 1 CIC. No EMC.
Steps to reproduce
1. Create a flavor with ephemeral storage:
$ nova flavor-create --ephemeral 120 m1.ephemeral auto 4096 50 2
2. Boot cirros VM
$ nova boot --flavor m1.ephemeral --image TestVM --nic net-id=<some net> foobar
3. Log into cirros VM and execute:
$ sudo umount /mnt
$ sudo mkdir /data
$ sudo mount -t tmpfs none /data/
$ sudo dd if=/dev/urandom bs=1M count=100 of=/data/data.bin
$ sudo mkfs.ext3 -b 4096 -J size=4 /dev/vdb
$ sudo mount -o data=journal,barrier=1 /dev/vdb /mnt/
4. Create write.sh:
#!/bin/sh
while true
do
dd if=/data/data.bin bs=1M count=100 of=/mnt/$$.tmp 2> /dev/null
done
5. Start around 20 instances of the script:
Run 20 times:
$ sudo ./write.sh &
6. Log into the compute-node where the cirros VM is running.
7. Create pin-io.py:
#!/usr/bin/python
import sys
import glob
import os
import random
numa0_cpus = map(int, sys.argv[1].split(","))
numa1_cpus = map(int, sys.argv[2].split(","))
exclude = []
pid = int(sys.argv[3])
exclude = [pid, int(sys.argv[4]), int(sys.argv[5])]
for tid in exclude:
os.system("taskset -cp 0 %i" % tid)
for tid_str in glob.glob("/proc/%i/task/*" % pid):
tid = int(os.path.basename(tid_str))
if tid in exclude:
continue
cpu = random.choice(numa1_cpus)
os.system("taskset -cp %i %i" % (cpu, tid))
8. Figure out the pid and the "libvirt name" of the qemu-process that
runs the cirros-image.
9. Get the thread-ids of the vcpu0 and vcpu1 threads
$ virsh qemu-monitor-command --hmp <libvirt vm-name> "info
cpus"
10. Pin the main-threads to numa-node 0 and the io-threads to numa-
node 1:
$ ./pin-io.py <numa0 pcpus> <numa1 pcpus> <qemu pid> <vcpu0
tid> <vcpu1 tid>
Example for dell-compute:
$ ./pin-io.py
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 12468 12501
12502
11. Wait a couple of days
12. You should see hanging tasks in the dmesg of the vm
Note that the pin-io part may not be needed but it should make the bug
appear more often.
I think I have reach the end of the line now and that we should
request the fix mentioned fix from Mirantis.
/Rickard Enberg
(Ericsson AB)
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1587039/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list