please cherrypick bugfix for Ubuntu 18.04 - frequent Xorg crash after suspend

Alan Jenkins alan.christopher.jenkins at gmail.com
Thu Jun 14 10:49:49 UTC 2018


Hi

Please cherry-pick commit "block: do not use interruptible wait 
anywhere" 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 for your 4.15 kernels.

I described it in 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776887  The text 
below is just a copy+paste from my bug report, for your convenience.

Regards

Alan

---

This upstream bug has been confirmed to affect Ubuntu users[1]. As per 
the fix commit (below), the most frequent symptom is a crash of 
Xorg/Xwayland, i.e. killing the entire GUI, when a laptop is woken from 
system sleep. Frequency of the bug is described as once every few days[2].

[1] E.g. this user confirms the bug & very specific workaround: 
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/1760450/comments/11 
<https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/1760450/comments/11>
[2] E.g. this log of crashes: 
https://bugzilla.redhat.com/show_bug.cgi?id=1553979#c23 
<https://bugzilla.redhat.com/show_bug.cgi?id=1553979#c23>

This is a bug in blk-core.c. It is not specific to any one hardware 
driver. Technically the suspend bug is triggered by the SCSI core - 
which is used by *all SATA devices*.

The commit also includes a test which quickly and reliably proves the 
existence of a horrifying bug.

I guess you might avoid this bug only if you have root on NVMe. The 
other way to not hit the Xorg crash is if you don't use all your RAM, so 
there's no pressure that leads to cold pages of Xorg being swapped. 
Also, you won't reproduce the Xorg crash if you suspend+resume 
immediately. (This frustrated my tests at one point, it only triggered 
after left the system suspended over lunch :).

Fix: "block: do not use interruptible wait anywhere"

in kernel 4.17: 
https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 
<https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428>

in kernel 4.16.8: 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6 
<https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6>

lack of fix in 4.15.0-24.26 (ubuntu 18.04): 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-24.26#n856 
<https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-24.26#n856>

---

Patch included inline for reference.  Almost certainly 
whitespace-damaged, as I'm sending with Thunderbird.

 From 899f1a4ba5e6634f7ba971cf495b749c898433a0 Mon Sep 17 00:00:00 2001
From: Alan Jenkins <alan.christopher.jenkins at gmail.com>
Date: Thu, 12 Apr 2018 15:47:36 +0100
Subject: [PATCH v2] block: do not use interruptible wait anywhere

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15.  A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed.  Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
   while killall -SIGUSR1 dd; do sleep 0.1; done & \
   echo mem > /sys/power/state ; \
   sleep 5; killall dd  # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche <bart.vanassche at wdc.com>
Cc: stable at vger.kernel.org
Signed-off-by: Alan Jenkins <alan.christopher.jenkins at gmail.com>
---
v2: fix indentation

  block/blk-core.c | 11 ++++-------
  1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..1a762f3980f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -915,7 +915,6 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
  
  	while (true) {
  		bool success = false;
-		int ret;
  
  		rcu_read_lock();
  		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
@@ -947,14 +946,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
  		 */
  		smp_rmb();
  
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				(atomic_read(&q->mq_freeze_depth) == 0 &&
-				 (preempt || !blk_queue_preempt_only(q))) ||
-				blk_queue_dying(q));
+		wait_event(q->mq_freeze_wq,
+			   (atomic_read(&q->mq_freeze_depth) == 0 &&
+			    (preempt || !blk_queue_preempt_only(q))) ||
+			   blk_queue_dying(q));
  		if (blk_queue_dying(q))
  			return -ENODEV;
-		if (ret)
-			return ret;
  	}
  }





More information about the kernel-team mailing list