[Bug 1587039] Re: aio: strengthen memory barriers for bottom half scheduling

Mon May 30 12:29:20 UTC 2016

the bug was introduced upstream by commit
c2e50e3d11a0bf4c973cc30478c1af0f2d5f8e81 (thread-pool: avoid per-thread-
pool EventNotifier).  Until that commit, the code in async.c was safe
because bottom halves are never used across threads.

It was fixed by upstream commit e8d3b1a25f284cdf9705b7cf0412281cc9ee3a36 released in QEMU 2.3.0
http://git.qemu.org/?p=qemu.git;a=commit;h=e8d3b1a25f284cdf9705b7cf0412281cc9ee3a36

QEMU 2.2 in cloud archive has this bug

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1587039

Title:
  aio: strengthen memory barriers for bottom half scheduling

Status in Ubuntu Cloud Archive:
  New

Bug description:
  [Impact]

  There are two problems with memory barriers in async.c.  The fix is
  to use atomic_xchg in order to achieve sequential consistency between
  the scheduling of a bottom half and the corresponding execution.

  First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
  not execute a memory barrier to order any writes needed by the callback
  before the read of bh->scheduled.  If the other side sees req->state as
  THREAD_ACTIVE, the callback is not invoked and you get deadlock.

  Second, the memory barrier in aio_bh_poll is too weak.  Without this
  patch, it is possible that bh->scheduled = 0 is not "published" until
  after the callback has returned.  Another thread wants to schedule the
  bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
  a lost wakeup.  The memory barrier should have been changed to smp_mb()
  in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
  2014-06-03) together with qemu_bh_schedule()'s. 

  Both of these involve a store and a load, so they are reproducible on
  x86_64 as well.  It is however much easier on aarch64, where the
  libguestfs test suite triggers the bug fairly easily.  Even there the
  failure can go away or appear depending on compiler optimization level,
  tracing options, or even kernel debugging options.

  [Test Case]

  Paul Leveille however reported how to trigger the problem within 15
  minutes on x86_64 as well.  His (untested) recipe, reproduced here
  for reference, is the following:

     1) Qcow2 (or 3) is critical – raw files alone seem to avoid the
  problem.

     2) Use “cache=directsync” rather than the default of
     “cache=none” to make it happen easier.

     3) Use a server with a write-back RAID controller to allow for rapid
     IO rates.

     4) Run a random-access load that (mostly) writes chunks to various
     files on the virtual block device.

        a. I use ‘diskload.exe c:25’, a Microsoft HCT load
           generator, on Windows VMs.

        b. Iometer can probably be configured to generate a similar
  load.

     5) Run multiple VMs in parallel, against the same storage device,
     to shake the failure out sooner.

     6) IvyBridge and Haswell processors for certain; not sure about
  others.

  A similar patch survived over 12 hours of testing, where an unpatched
  QEMU would fail within 15 minutes.

  [Regression Potential]

  There should be minimal regression potential in this patch

  [Original text below]

  I have a strong belief that the guest hanging is caused by the bug
  described in this upstream commit: https://lists.gnu.org/archive/html
  /qemu-devel/2015-04/msg00625.html. I have managed to capture a core-
  dump of a qemu-process with a hanging VM. That has shown that qemu is
  waiting for a io-request to be completed, but it never happens. After
  some more analyzing I can see that the request actually is completed
  but the completion-routine seems to have never been called. From the
  core-dump analyze and the bug-description in the commit that I
  mention, I strongly suspect the memory-barriers is the problem.
  Unfortunately I have not yet found a good and fast way to prove that
  is the case. The tricky thing is that the guest needs to dispatch a
  write (or flush) that is a barrier (i.e no other requests will be
  dispatched until that request is finished) and that exact request
  needs to be the one for which the completion is lost. This is hard to
  accomplish.

  I have been running a test for a couple of days (~4 days) with a qemu
  without the fix and one with the fix included
  (https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg00625.html).
  The one without the fix has gotten several "hung_task" messages. The
  other one as not yet received such a message.

  [53280.284059] INFO: task flush-253:16:304 blocked for more than 120 seconds.
  [53280.285546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [53280.287046] flush-253:16    D 0000000000000001     0   304      2 0x00000000
  [53280.287058]  ffff880138731710 0000000000000046 000000000000000e ffff88013fd13770
  [53280.288652]  ffff880138731fd8 ffff880138731fd8 ffff880138731fd8 0000000000013700
  [53280.290255]  ffff880138672de0 ffff880136fc96f0 0000000000000001 ffff88006f1d5cf8
  [53280.291772] Call Trace:
  [53280.292324]  [<ffffffff8165badf>] schedule+0x3f/0x60
  [53280.293294]  [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
  [53280.294431]  [<ffffffff8165c482>] mutex_lock+0x22/0x40
  [53280.295379]  [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
  [53280.296493]  [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
  [53280.297681]  [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
  [53280.298693]  [<ffffffff812567d0>] journal_start+0xc0/0x100
  [53280.299751]  [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
  [53280.300868]  [<ffffffff811f7f68>] ext3_journalled_writepage+0xd8/0x240
  [53280.302078]  [<ffffffff81121de7>] __writepage+0x17/0x40
  [53280.303053]  [<ffffffff81122518>] write_cache_pages+0x208/0x460
  [53280.304095]  [<ffffffff81121dd0>] ? set_page_dirty_lock+0x50/0x50
  [53280.305207]  [<ffffffff811227ba>] generic_writepages+0x4a/0x70
  [53280.306275]  [<ffffffff81123725>] do_writepages+0x35/0x40
  [53280.307225]  [<ffffffff811a2850>] writeback_single_inode+0x170/0x420
  [53280.308407]  [<ffffffff811a2f16>] writeback_sb_inodes+0x1b6/0x270
  [53280.309542]  [<ffffffff811a306e>] __writeback_inodes_wb+0x9e/0xd0
  [53280.310745]  [<ffffffff811a330b>] wb_writeback+0x26b/0x310
  [53280.311704]  [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
  [53280.312756]  [<ffffffff811949b2>] ? get_nr_dirty_inodes+0x52/0x80
  [53280.313914]  [<ffffffff811a344f>] wb_check_old_data_flush+0x9f/0xb0
  [53280.315027]  [<ffffffff811a4321>] wb_do_writeback+0x151/0x1d0
  [53280.316087]  [<ffffffff810777b0>] ? usleep_range+0x50/0x50
  [53280.317103]  [<ffffffff811a4423>] bdi_writeback_thread+0x83/0x280
  [53280.318208]  [<ffffffff811a43a0>] ? wb_do_writeback+0x1d0/0x1d0
  [53280.319277]  [<ffffffff8108ae3c>] kthread+0x8c/0xa0
  [53280.320213]  [<ffffffff81668174>] kernel_thread_helper+0x4/0x10
  [53280.321286]  [<ffffffff8108adb0>] ? flush_kthread_worker+0xa0/0xa0
  [53280.322398]  [<ffffffff81668170>] ? gs_change+0x13/0x13

  [53280.323369] INFO: task dd:25713 blocked for more than 120 seconds.
  [53280.324480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [53280.332587] dd              D 0000000000000001     0 25713    389 0x00000000
  [53280.332591]  ffff88011870d918 0000000000000082 ffff88011870d938 ffffffff8105fc08
  [53280.334156]  ffff88011870dfd8 ffff88011870dfd8 ffff88011870dfd8 0000000000013700
  [53280.335696]  ffff88013683ade0 ffff880136fc8000 0000000000000002 ffff88006f1d5cf8
  [53280.337258] Call Trace:
  [53280.337781]  [<ffffffff8105fc08>] ? load_balance+0x78/0x370
  [53280.338803]  [<ffffffff8165badf>] schedule+0x3f/0x60
  [53280.339738]  [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
  [53280.340868]  [<ffffffff8165c482>] mutex_lock+0x22/0x40
  [53280.341841]  [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
  [53280.342880]  [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
  [53280.343975]  [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
  [53280.345023]  [<ffffffff812567d0>] journal_start+0xc0/0x100
  [53280.346043]  [<ffffffff81118ec8>] ? grab_cache_page_write_begin+0x78/0xe0
  [53280.347194]  [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
  [53280.348271]  [<ffffffff811f9eef>] ext3_write_begin+0x9f/0x270
  [53280.349276]  [<ffffffff8111839a>] generic_perform_write+0xca/0x210
  [53280.350368]  [<ffffffff811fbd4d>] ? ext3_dirty_inode+0x6d/0xa0
  [53280.351327]  [<ffffffff8111853d>] generic_file_buffered_write+0x5d/0x90
  [53280.352457]  [<ffffffff81119ea9>] __generic_file_aio_write+0x229/0x440
  [53280.353582]  [<ffffffff8111a132>] generic_file_aio_write+0x72/0xe0
  [53280.354708]  [<ffffffff81178e0a>] do_sync_write+0xda/0x120
  [53280.355702]  [<ffffffff812d9588>] ? apparmor_file_permission+0x18/0x20
  [53280.356887]  [<ffffffff8129ec6c>] ? security_file_permission+0x2c/0xb0
  [53280.358092]  [<ffffffff811793b1>] ? rw_verify_area+0x61/0xf0
  [53280.359149]  [<ffffffff81179713>] vfs_write+0xb3/0x180
  [53280.360671]  [<ffffffff81179a3a>] sys_write+0x4a/0x90
  [53280.361613]  [<ffffffff81666002>] system_call_fastpath+0x16/0x1b

  Setup: 
  DELL (R620) machine with 4 computes and 1 CIC. No EMC.

  Steps to reproduce

  1. Create a flavor with ephemeral storage:
       $ nova flavor-create --ephemeral 120 m1.ephemeral auto 4096 50 2

  2. Boot cirros VM
       $ nova boot --flavor m1.ephemeral --image TestVM --nic net-id=<some net> foobar

  3. Log into cirros VM and execute:
       $ sudo umount /mnt
       $ sudo mkdir /data
       $ sudo mount -t tmpfs none /data/
       $ sudo dd if=/dev/urandom bs=1M count=100 of=/data/data.bin
       $ sudo mkfs.ext3 -b 4096 -J size=4 /dev/vdb
       $ sudo mount -o data=journal,barrier=1 /dev/vdb /mnt/

  4. Create write.sh:

       #!/bin/sh

       while true
       do
           dd if=/data/data.bin bs=1M count=100 of=/mnt/$$.tmp 2> /dev/null
       done

  5. Start around 20 instances of the script:

      Run 20 times:
      $ sudo ./write.sh &

  6. Log into the compute-node where the cirros VM is running.

  7. Create pin-io.py:

      #!/usr/bin/python

      import sys
      import glob
      import os
      import random

      numa0_cpus = map(int, sys.argv[1].split(","))
      numa1_cpus = map(int, sys.argv[2].split(","))

      exclude = []

      pid = int(sys.argv[3])

      exclude = [pid, int(sys.argv[4]), int(sys.argv[5])]

      for tid in exclude:
          os.system("taskset -cp 0 %i" % tid)

      for tid_str in glob.glob("/proc/%i/task/*" % pid):
          tid = int(os.path.basename(tid_str))

          if tid in exclude:
              continue

          cpu = random.choice(numa1_cpus)

          os.system("taskset -cp %i %i" % (cpu, tid))

  8. Figure out the pid and the "libvirt name" of the qemu-process that
  runs the cirros-image.

  9. Get the thread-ids of the vcpu0 and vcpu1 threads

         $ virsh qemu-monitor-command --hmp <libvirt vm-name> "info
  cpus"

  10. Pin the main-threads to numa-node 0 and the io-threads to numa-
  node 1:

         $ ./pin-io.py <numa0 pcpus> <numa1 pcpus> <qemu pid> <vcpu0
  tid> <vcpu1 tid>

      Example for dell-compute:

         $ ./pin-io.py
  0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
  1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 12468 12501
  12502

  11. Wait a couple of days

  12. You should see hanging tasks in the dmesg of the vm

  Note that the pin-io part may not be needed but it should make the bug
  appear more often.

  I think I have reach the end of the line now and that we should
  request the fix mentioned fix from Mirantis.

  /Rickard Enberg 
  (Ericsson AB)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1587039/+subscriptions