[Bug 1894772] Re: live migration of windows 2012 r2 instance with virtio balloon driver fails from mitaka to queens.

Seyeong Kim 1894772 at bugs.launchpad.net
Thu Sep 10 10:09:44 UTC 2020


** Patch removed: "lp1894772_xenial.debdiff"
   https://bugs.launchpad.net/ubuntu/xenial/+source/qemu/+bug/1894772/+attachment/5408515/+files/lp1894772_xenial.debdiff

** Patch removed: "lp1894772_mitaka.debdiff"
   https://bugs.launchpad.net/ubuntu/xenial/+source/qemu/+bug/1894772/+attachment/5408516/+files/lp1894772_mitaka.debdiff

** Also affects: qemu (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: qemu (Ubuntu Groovy)
   Importance: Undecided
       Status: Fix Released

** Also affects: qemu (Ubuntu Focal)
   Importance: Undecided
       Status: New

** No longer affects: qemu (Ubuntu Xenial)

** Description changed:

  [Impact]
  
  livemigration  of windows 2012 r2 instance with virtio balloon driver
  from qemu 2.5(mitaka) to qemu 2.11(queens) is not working properly.
  
  Especially instance keep moving e.g 2.5 -> 2.5 -> 2.11
  
  Then It shows below msg from the 2nd mitaka node.
  
  Migration: [ 94 %]error: internal error: qemu unexpectedly closed the monitor: 2020-09-07T07:45:11.799345Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control
  2020-09-07T07:45:12.765618Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2
  2020-09-07T07:45:12.765642Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
  2020-09-07T07:45:12.765648Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:07.0/virtio-balloon'
  2020-09-07T07:45:12.766483Z qemu-system-x86_64: load of migration failed: Operation not permitted
+ 
+ After patching for CVE-2016-5403, we did workaround with
+ CVE-2015-5403-6.patch,
  
  [Test Case]
  
  Deploy 2 mitaka-staging machines kvm host
  Deploy 1 queens-staging machines kvm host
  
  Setting NFS server and client between them.
  
  Deploy windows 2012r2 guest instance with virtio balloon driver on one
  of the mitaka host
  
  Migrate it from mitaka to mitaka (it should be ok )
  Migrate it from mitaka to queens ( it raises error )
  
  I can reproduce this issue with baremetal or vm host
  
  [Regressions]
  As this patch is qemu related, current instance should be restarted to have this fix.
  Also, this patch may cause failure of vm starting, migrating related to virtio drivers.
  Especially Windows guest vm.
  
  [Others]
  
- I bisected this issue and found one commit below, and the others are
- needed for this.
+ Description: make sure vdev->vq[i].inuse never goes below 0
+  This is a work-around to fix live migrations after the patches for
+  CVE-2016-5403 were applied. The true root cause still needs to be
+  determined.
+ Origin: based on a patch by Len <lwhite at coreitx.com>
+ Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
  
- ####
- From 4eae2a657d1ff5ada56eb9b4966eae0eff333b0b Mon Sep 17 00:00:00 2001
- From: Ladi Prosek <lprosek at redhat.com>
- Date: Tue, 1 Mar 2016 12:14:03 +0100
- Subject: [PATCH] balloon: fix segfault and harden the stats queue
- 
- The segfault here is triggered by the driver notifying the stats queue
- twice after adding a buffer to it. This effectively resets stats_vq_elem
- back to NULL and QEMU crashes on the next stats timer tick in
- balloon_stats_poll_cb.
- 
- This is a regression introduced in 51b19ebe4320f3dc, although admittedly
- the device assumed too much about the stats queue protocol even before
- that commit. This commit adds a few more checks and ensures that the one
- stats buffer gets deallocated on device reset.
- 
- Cc: qemu-stable at nongnu.org
- Signed-off-by: Ladi Prosek <lprosek at redhat.com>
- Reviewed-by: Michael S. Tsirkin <mst at redhat.com>
- Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
- 
- ####
- From 3eb769fd1cf15f16ca796ab5618efe89b23aa625 Mon Sep 17 00:00:00 2001
- From: Gerd Hoffmann <kraxel at redhat.com>
- Date: Tue, 1 Dec 2015 12:05:14 +0100
- Subject: [PATCH] virtio-gpu: maintain command queue
- 
- We'll go take out the commands we receive out of the virt queue and put
- them into a linked list, to decouple virtio queue handling from actual
- command processing.
- 
- Also move cmd processing to new virtio_gpu_handle_ctrl func, so we can
- easily kick it from different places.
- 
- Signed-off-by: Gerd Hoffmann <kraxel at redhat.com>
- 
- ####
- From 6aa46d8ff1ee7e9ca0c4a54d75c74108bee22124 Mon Sep 17 00:00:00 2001
- From: Paolo Bonzini <pbonzini at redhat.com>
- Date: Sun, 31 Jan 2016 11:28:57 +0100
- Subject: [PATCH] virtio: move VirtQueueElement at the beginning of the structs
- 
- The next patch will make virtqueue_pop/vring_pop allocate memory for
- the VirtQueueElement. In some cases (blk, scsi, gpu) the device wants
- to extend VirtQueueElement with device-specific fields and, until now,
- the place of the VirtQueueElement within the containing struct didn't
- matter. When allocating the entire block in virtqueue_pop/vring_pop,
- however, the containing struct must basically be a "subclass" of
- VirtQueueElement, with the VirtQueueElement as the first field. Make
- that the case for blk and scsi; gpu is already doing it.
- 
- Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
- Reviewed-by: Michael S. Tsirkin <mst at redhat.com>
- Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
- Reviewed-by: Cornelia Huck <cornelia.huck at de.ibm.com>
- 
- 
- ####
- From 51b19ebe4320f3dcd93cea71235c1219318ddfd2 Mon Sep 17 00:00:00 2001
- From: Paolo Bonzini <pbonzini at redhat.com>
- Date: Thu, 4 Feb 2016 16:26:51 +0200
- Subject: [PATCH] virtio: move allocation to virtqueue_pop/vring_pop
- 
- The return code of virtqueue_pop/vring_pop is unused except to check for
- errors or 0.  We can thus easily move allocation inside the functions
- and just return a pointer to the VirtQueueElement.
- 
- The advantage is that we will be able to allocate only the space that
- is needed for the actual size of the s/g list instead of the full
- VIRTQUEUE_MAX_SIZE items.  Currently VirtQueueElement takes about 48K
- of memory, and this kind of allocation puts a lot of stress on malloc.
- By cutting the size by two or three orders of magnitude, malloc can
- use much more efficient algorithms.
- 
- The patch is pretty large, but changes to each device are testable
- more or less independently.  Splitting it would mostly add churn.
- 
- Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
- Reviewed-by: Michael S. Tsirkin <mst at redhat.com>
- Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
- Reviewed-by: Cornelia Huck <cornelia.huck at de.ibm.com>
+ Index: qemu-2.5+dfsg/hw/virtio/virtio.c
+ ===================================================================
+ --- qemu-2.5+dfsg.orig/hw/virtio/virtio.c       2017-04-05 09:48:17.420025137 -0400
+ +++ qemu-2.5+dfsg/hw/virtio/virtio.c    2017-04-05 09:49:59.565337543 -0400
+ @@ -1510,6 +1510,7 @@
+      for (i = 0; i < num; i++) {
+          if (vdev->vq[i].vring.desc) {
+              uint16_t nheads;
+ +            int inuse_tmp;
+              nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
+              /* Check it isn't doing strange things with descriptor numbers. */
+              if (nheads > vdev->vq[i].vring.num) {
+ @@ -1527,12 +1528,15 @@
+               * Since max ring size < UINT16_MAX it's safe to use modulo
+               * UINT16_MAX + 1 subtraction.
+               */
+ -            vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx -
+ +            inuse_tmp = (int)(vdev->vq[i].last_avail_idx -
+                                  vring_used_idx(&vdev->vq[i]));
+ +
+ +            vdev->vq[i].inuse = (inuse_tmp < 0 ? 0 : inuse_tmp);
+ +
+              if (vdev->vq[i].inuse > vdev->vq[i].vring.num) {
+ -                error_report("VQ %d size 0x%x < last_avail_idx 0x%x - "
+ +                error_report("VQ %d inuse %u size 0x%x < last_avail_idx 0x%x - "
+                               "used_idx 0x%x",
+ -                             i, vdev->vq[i].vring.num,
+ +                             i, vdev->vq[i].inuse, vdev->vq[i].vring.num,
+                               vdev->vq[i].last_avail_idx,
+                               vring_used_idx(&vdev->vq[i]));
+                  return -1;

** Patch added: "lp1894772_queens.debdiff"
   https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1894772/+attachment/5409317/+files/lp1894772_queens.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1894772

Title:
  live migration of windows 2012 r2 instance with virtio balloon driver
  fails from mitaka to queens.

Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Focal:
  In Progress
Status in qemu source package in Groovy:
  Fix Released

Bug description:
  [Impact]

  livemigration  of windows 2012 r2 instance with virtio balloon driver
  from qemu 2.5(mitaka) to qemu 2.11(queens) is not working properly.

  Especially instance keep moving e.g 2.5 -> 2.5 -> 2.11

  Then It shows below msg from the 2nd mitaka node.

  Migration: [ 94 %]error: internal error: qemu unexpectedly closed the monitor: 2020-09-07T07:45:11.799345Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control
  2020-09-07T07:45:12.765618Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2
  2020-09-07T07:45:12.765642Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
  2020-09-07T07:45:12.765648Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:07.0/virtio-balloon'
  2020-09-07T07:45:12.766483Z qemu-system-x86_64: load of migration failed: Operation not permitted

  After patching for CVE-2016-5403, we did workaround with
  CVE-2015-5403-6.patch,

  [Test Case]

  Deploy 2 mitaka-staging machines kvm host
  Deploy 1 queens-staging machines kvm host

  Setting NFS server and client between them.

  Deploy windows 2012r2 guest instance with virtio balloon driver on one
  of the mitaka host

  Migrate it from mitaka to mitaka (it should be ok )
  Migrate it from mitaka to queens ( it raises error )

  I can reproduce this issue with baremetal or vm host

  [Regressions]
  As this patch is qemu related, current instance should be restarted to have this fix.
  Also, this patch may cause failure of vm starting, migrating related to virtio drivers.
  Especially Windows guest vm.

  [Others]

  Description: make sure vdev->vq[i].inuse never goes below 0
   This is a work-around to fix live migrations after the patches for
   CVE-2016-5403 were applied. The true root cause still needs to be
   determined.
  Origin: based on a patch by Len <lwhite at coreitx.com>
  Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

  Index: qemu-2.5+dfsg/hw/virtio/virtio.c
  ===================================================================
  --- qemu-2.5+dfsg.orig/hw/virtio/virtio.c       2017-04-05 09:48:17.420025137 -0400
  +++ qemu-2.5+dfsg/hw/virtio/virtio.c    2017-04-05 09:49:59.565337543 -0400
  @@ -1510,6 +1510,7 @@
       for (i = 0; i < num; i++) {
           if (vdev->vq[i].vring.desc) {
               uint16_t nheads;
  +            int inuse_tmp;
               nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
               /* Check it isn't doing strange things with descriptor numbers. */
               if (nheads > vdev->vq[i].vring.num) {
  @@ -1527,12 +1528,15 @@
                * Since max ring size < UINT16_MAX it's safe to use modulo
                * UINT16_MAX + 1 subtraction.
                */
  -            vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx -
  +            inuse_tmp = (int)(vdev->vq[i].last_avail_idx -
                                   vring_used_idx(&vdev->vq[i]));
  +
  +            vdev->vq[i].inuse = (inuse_tmp < 0 ? 0 : inuse_tmp);
  +
               if (vdev->vq[i].inuse > vdev->vq[i].vring.num) {
  -                error_report("VQ %d size 0x%x < last_avail_idx 0x%x - "
  +                error_report("VQ %d inuse %u size 0x%x < last_avail_idx 0x%x - "
                                "used_idx 0x%x",
  -                             i, vdev->vq[i].vring.num,
  +                             i, vdev->vq[i].inuse, vdev->vq[i].vring.num,
                                vdev->vq[i].last_avail_idx,
                                vring_used_idx(&vdev->vq[i]));
                   return -1;

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1894772/+subscriptions



More information about the Ubuntu-sponsors mailing list