ACK: [SRU][F][PATCH 0/1] xfs: Preallocated ioend transactions cause deadlock due to log buffer exhaustion
Khaled Elmously
khalid.elmously at canonical.com
Tue Feb 14 07:14:57 UTC 2023
On 2023-02-14 18:50:21 , Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2007219
>
> [Impact]
>
> A deadlock exists in the XFS filesystem that occurs when the XFS log buffer
> becomes completely exhausted.
>
> XFS maintains a circular ring buffer for the log, and it is a fixed size. To be
> able to create a transaction, you need to be able to reserve space on the log
> buffer.
>
> Certain ioend transactions, such as file append, can be preallocated for a
> negligible performance gain. This takes up space in the log buffer, and these
> preallocated ioends are placed on a workqueue, behind other ioends that are not
> preallocated.
>
> The deadlock occurs when the XFS log buffer is running low on space, and an
> ioend append transaction comes in. It is preallocated, consuming nearly all of
> the remaining XFS log buffer space, and is placed at the very end of the ioend
> workqueue. The kernel then takes a ioend from the top of the ioend
> workqeueue, creates a transaction, xfs_trans_alloc(), attempts to allocate space
> for it, xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
> checking free space in the log, xlog_grant_head_check(), xlog_grant_head_wait().
>
> Since there is no space, the thread sleeps with schedule(). This happens with
> all ioend transactions, since the log is exhausted and I/O is not moving.
>
> Since I/O never moves, the thread is never woken up, and we get hung task
> timeouts, with system failure shortly afterward.
>
> An example hung task timeout is:
>
> INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
> Tainted: G OE 5.4.0-137-generic #154-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/60:0 D 0 4002982 2 0x90004080
> Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
> Call Trace:
> __schedule+0x2e3/0x740
> schedule+0x42/0xb0
> xlog_grant_head_wait+0xb9/0x1e0 [xfs]
> xlog_grant_head_check+0xde/0x100 [xfs]
> xfs_log_reserve+0xc9/0x1e0 [xfs]
> xfs_trans_reserve+0x17a/0x1e0 [xfs]
> xfs_trans_alloc+0xda/0x170 [xfs]
> xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
> xfs_end_ioend+0x15b/0x1b0 [xfs]
> xfs_end_io+0xb1/0xe0 [xfs]
> process_one_work+0x1eb/0x3b0
> worker_thread+0x4d/0x400
> kthread+0x104/0x140
> ? process_one_work+0x3b0/0x3b0
> ? kthread_park+0x90/0x90
> ret_from_fork+0x1f/0x40
>
> There is no known workaround, other than to have a larger log buffer at
> filesystem creation, but even then, it only buys you time until you get high
> enough load to exhaust the log buffer.
>
> [Fix]
>
> This was fixed in 5.13-rc1 by the following patch:
>
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster at redhat.com>
> Date: Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
>
> The patch more or less removes all preallocated ioend transactions, and instead,
> when ioend appends are needed, they go to the standard workqueue like any other
> ioend, where transactions are allocated when they reach the top of the
> workqueue.
>
> The patch required some backporting for Focal. The changes to the patch itself
> is minimal and should be straightforward to read, however, the changes to the
> XFS ioend subsystem between 5.4 and 5.13-rc1 were quite extreme, with a lot of
> refactoring taking place over very many commits.
>
> Additionally, the patch was part of a five part series, the first, fixes the
> deadlock, and the rest remove all the code to do with transaction preallocation.
>
> It is safe to leave the rest of the code in place. It will become dead code, but
> it will not be reachable, and not cause any risk of regression, due to
> ioend->io_append_trans always being NULL, and not entering into any of the if
> statements.
>
> [Testcase]
>
> There is currently no known testcase for this issue. The issue has only been
> seen in a customer production environment, running under heavy load. The issue
> has not been seen in a customer test environment, only production.
>
> The production workload is a busy Kubernetes cluster running containers and VMs,
> with the hosts's filesystem being broken into separate mountpoints over several
> partitions, all XFS.
>
> The kubernetes containers are all backed from a large 4TB disk which is XFS.
>
> A test kernel is available in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
>
> This test kernel has been deployed to several production hosts, and the deadlock
> no longer occurs.
>
> [Where problems can occur]
>
> We are changing how certain ioend transactions take place in the XFS filesystem.
> If a regression were to occur, it would impact all XFS users. Users would have
> to downgrade their kernel, as there are no workarounds for enabling or disabling
> the behaviour change.
>
> The overall risk of the change should be low. ioend append transactions would
> still be processed in nearly the same way, still being placed at the end of the
> ioend workqueue like any other transaction, with the only change being it is
> allocated at the time the transaction is created, at the top of the workqueue
> when it is processed, and not preallocated when the ioend is first submitted.
>
> There will be a very minor performance penalty, but it wouldn't be measurable
> in any tangible workload.
>
> I have run xfstests for the xfs/* subset against the released 5.4.0-137-generic
> and the test 5.4.0-137-generic test kernel with the patch included. The test
> kernel had identical results, it will likely not cause any regressions.
>
> There is some additional risk leaving the dead code in place, but I have read
> the code and the commits to remove the dead code, and I came to the conclusion
> it is less risky to leave it in place, than to backport the refactor commits.
>
> [Other Info]
>
> The XFS ioend subsystem refactor took place in the following commits:
>
> commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
> From: Christoph Hellwig <hch at lst.de>
> Date: Thu, 17 Oct 2019 13:12:15 -0700
> Subject: iomap: lift the xfs writeback code to iomap
> Link: https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
>
> commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
> From: Christoph Hellwig <hch at lst.de>
> Date: Thu, 17 Oct 2019 13:12:13 -0700
> Subject: iomap: lift common tracing code from xfs to iomap
> Link: https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
>
> commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
> From: Christoph Hellwig <hch at lst.de>
> Date: Thu, 17 Oct 2019 13:12:07 -0700
> Subject: xfs: refactor the ioend merging code
> Link: https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
>
> ioend->io_append_trans was renamed to ioend->io_private in the following commit:
>
> commit 5653017bc44e54baa299f3523f160c23ac0628fd
> From: Christoph Hellwig <hch at lst.de>
> Date: Thu, 17 Oct 2019 13:12:09 -0700
> Subject: xfs: turn io_append_trans into an io_private void pointer
> Link: https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
>
> The full five part preallocated ioend deadlock patch series is:
>
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster at redhat.com>
> Date: Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
>
> commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
> Author: Brian Foster <bfoster at redhat.com>
> Date: Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: open code ioend needs workqueue helper
> Link: https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
>
> commit 044c6449f18f174ba8d86640936add3fc7582e49
> Author: Brian Foster <bfoster at redhat.com>
> Date: Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: drop unused ioend private merge and setfilesize code
> Link: https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
>
> commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> Author: Brian Foster <bfoster at redhat.com>
> Date: Fri Apr 9 10:27:56 2021 -0700
> Subject: xfs: drop unnecessary setfilesize helper
> Link: https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
>
> commit 6e552494fb90acae005d74ce6a2ee102d965184b
> Author: Brian Foster <bfoster at redhat.com>
> Date: Tue May 4 08:54:29 2021 -0700
> Subject: iomap: remove unused private field from ioend
> Link: https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
>
> As you can see, the four latter commits are not necessary. They simply remove
> dead code, which has no harm being left in place. They also do not backport
> at all, not without the ALL of the significant refactor commits.
>
> Hence, we only take "xfs: drop submit side trans alloc for append ioends" only,
> as it is the only needed commit to solve the problem.
>
> Brian Foster (1):
> xfs: drop submit side trans alloc for append ioends
>
> fs/xfs/xfs_aops.c | 43 +++----------------------------------------
> 1 file changed, 3 insertions(+), 40 deletions(-)
>
Acked-by: Khalid Elmously <khalid.elmously at canonical.com>
Also, I believe the convention we follow is "backported from commit xxxxx" not "backported from xxxxx".
> --
> 2.37.2
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list