please cherrypick bugfix for Ubuntu 18.04 - frequent Xorg crash after suspend

Thu Jun 14 14:43:16 UTC 2018

On 06/14/2018 06:49 AM, Alan Jenkins wrote:
> Hi
>
> Please cherry-pick commit "block: do not use interruptible wait
> anywhere" 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 for your 4.15 kernels.
>
> I described it in
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776887  The text
> below is just a copy+paste from my bug report, for your convenience.
>
> Regards
>
> Alan
>
> ---
>
> This upstream bug has been confirmed to affect Ubuntu users[1]. As per
> the fix commit (below), the most frequent symptom is a crash of
> Xorg/Xwayland, i.e. killing the entire GUI, when a laptop is woken
> from system sleep. Frequency of the bug is described as once every few
> days[2].
>
> [1] E.g. this user confirms the bug & very specific workaround:
> https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/1760450/comments/11
> <https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/1760450/comments/11>
>
> [2] E.g. this log of crashes:
> https://bugzilla.redhat.com/show_bug.cgi?id=1553979#c23
> <https://bugzilla.redhat.com/show_bug.cgi?id=1553979#c23>
>
> This is a bug in blk-core.c. It is not specific to any one hardware
> driver. Technically the suspend bug is triggered by the SCSI core -
> which is used by *all SATA devices*.
>
> The commit also includes a test which quickly and reliably proves the
> existence of a horrifying bug.
>
> I guess you might avoid this bug only if you have root on NVMe. The
> other way to not hit the Xorg crash is if you don't use all your RAM,
> so there's no pressure that leads to cold pages of Xorg being swapped.
> Also, you won't reproduce the Xorg crash if you suspend+resume
> immediately. (This frustrated my tests at one point, it only triggered
> after left the system suspended over lunch :).
>
> Fix: "block: do not use interruptible wait anywhere"
>
> in kernel 4.17:
> https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428
> <https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428>
>
>
> in kernel 4.16.8:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6
> <https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6>
>
>
> lack of fix in 4.15.0-24.26 (ubuntu 18.04):
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-24.26#n856
> <https://git.launchpad.net/%7Eubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-24.26#n856>
>
>
> ---
>
> Patch included inline for reference.  Almost certainly
> whitespace-damaged, as I'm sending with Thunderbird.
>
> From 899f1a4ba5e6634f7ba971cf495b749c898433a0 Mon Sep 17 00:00:00 2001
> From: Alan Jenkins <alan.christopher.jenkins at gmail.com>
> Date: Thu, 12 Apr 2018 15:47:36 +0100
> Subject: [PATCH v2] block: do not use interruptible wait anywhere
>
> When blk_queue_enter() waits for a queue to unfreeze, or unset the
> PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.
>
> The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
> ("block, scsi: Make SCSI quiesce and resume work reliably").  Note the
> SCSI
> device is resumed asynchronously, i.e. after un-freezing userspace tasks.
>
> So that commit exposed the bug as a regression in v4.15.  A mysterious
> SIGBUS (or -EIO) sometimes happened during the time the device was being
> resumed.  Most frequently, there was no kernel log message, and we saw
> Xorg
> or Xwayland killed by SIGBUS.[1]
>
> [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
>
> Without this fix, I get an IO error in this test:
>
> # dd if=/dev/sda of=/dev/null iflag=direct & \
>   while killall -SIGUSR1 dd; do sleep 0.1; done & \
>   echo mem > /sys/power/state ; \
>   sleep 5; killall dd  # stop after 5 seconds
>
> The interruptible wait was added to blk_queue_enter in
> commit 3ef28e83ab15 ("block: generic request_queue reference counting").
> Before then, the interruptible wait was only in blk-mq, but I don't think
> it could ever have been correct.
>
> Reviewed-by: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: stable at vger.kernel.org
> Signed-off-by: Alan Jenkins <alan.christopher.jenkins at gmail.com>
> ---
> v2: fix indentation
>
>  block/blk-core.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index abcb8684ba67..1a762f3980f2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -915,7 +915,6 @@ int blk_queue_enter(struct request_queue *q,
> blk_mq_req_flags_t flags)
>  
>      while (true) {
>          bool success = false;
> -        int ret;
>  
>          rcu_read_lock();
>          if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> @@ -947,14 +946,12 @@ int blk_queue_enter(struct request_queue *q,
> blk_mq_req_flags_t flags)
>           */
>          smp_rmb();
>  
> -        ret = wait_event_interruptible(q->mq_freeze_wq,
> -                (atomic_read(&q->mq_freeze_depth) == 0 &&
> -                 (preempt || !blk_queue_preempt_only(q))) ||
> -                blk_queue_dying(q));
> +        wait_event(q->mq_freeze_wq,
> +               (atomic_read(&q->mq_freeze_depth) == 0 &&
> +                (preempt || !blk_queue_preempt_only(q))) ||
> +               blk_queue_dying(q));
>          if (blk_queue_dying(q))
>              return -ENODEV;
> -        if (ret)
> -            return ret;
>      }
>  }
>
>
Hi Alan,

I just assigned the bug to myself.  I'll take a look and should have a
test kernel posted to the bug shortly.

Thanks,

Joe