APPLIED: [SRU][Noble][PATCH 0/3] mm/folios: xfs hangs with hung task timeouts with corrupted folio pointer lists
Stefan Bader
stefan.bader at canonical.com
Fri Oct 25 14:41:02 UTC 2024
On 24.10.24 10:43, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2085495
>
> [Impact]
>
> A long running, and incredibly difficult to reproduce large folio issue leads to
> hung task timeouts in the xfs subsystem with the following stack trace:
>
> CPU: 0 PID: 226487 Comm: xfs_io Tainted: G L 6.5.0-41-generic #41~22.04.2-Ubuntu
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:xas_descend+0x25/0xd0
> Code: 90 90 90 90 90 55 48 89 e5 41 56 41 55 49 89 fd 41 54 49 89 f4 53 48 83 ec 08 0f b6 0e 48 8b 5f 08 80 f9 3f 0f 87 5d 2f 07 00 <48> d3 eb 83 e3 3f 89 d8 48 83 c0 04 49 8b 44 c4 08 4d 89 65 18 48
> RSP: 0018:ffffaf9b44927a68 EFLAGS: 00000293
> RAX: ffff8d61568f36d2 RBX: 00000000000005c0 RCX: 0000000000000006
> RDX: 0000000000000002 RSI: ffff8d61568f36d0 RDI: ffffaf9b44927b10
> RBP: ffffaf9b44927a90 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff8d6159120938 R11: 0000000000000000 R12: ffff8d61568f36d0
> R13: ffffaf9b44927b10 R14: ffffaf9b44927e30 R15: ffffaf9b44927e08
> FS: 00007bcf4ce2c840(0000) GS:ffff8d61be400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007bcf4ca80df8 CR3: 0000000008524005 CR4: 0000000000370ef0
> Call Trace:
> <IRQ>
> ? show_regs+0x6d/0x80
> ? watchdog_timer_fn+0x1d8/0x240
> ? __pfx_watchdog_timer_fn+0x10/0x10
> ? __hrtimer_run_queues+0x10f/0x2a0
> ? kvm_clock_get_cycles+0x18/0x40
> ? hrtimer_interrupt+0xf6/0x250
> ? __sysvec_apic_timer_interrupt+0x5f/0x140
> ? sysvec_apic_timer_interrupt+0x8d/0xd0
> </IRQ>
> <TASK>
> ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> ? xas_descend+0x25/0xd0
> xas_load+0x4c/0x60
> __xas_next+0xa9/0x150
> filemap_get_read_batch+0x1a3/0x2e0
> filemap_get_pages+0xa9/0x3b0
> ? touch_atime+0x44/0x1c0
> filemap_read+0xe7/0x430
> generic_file_read_iter+0xbb/0x110
> ? down_read+0x12/0xc0
> xfs_file_buffered_read+0x57/0xe0 [xfs]
> xfs_file_read_iter+0xb6/0x1c0 [xfs]
> ? security_file_permission+0x5f/0x70
> vfs_read+0x20a/0x360
> __x64_sys_pread64+0xa6/0xd0
> x64_sys_call+0x1e01/0x20b0
> do_syscall_64+0x55/0x90
> ? do_syscall_64+0x61/0x90
> ? syscall_exit_to_user_mode+0x37/0x60
> ? do_syscall_64+0x61/0x90
> entry_SYSCALL_64_after_hwframe+0x73/0xdd
> RIP: 0033:0x7bcf4cd1278f
> Code: 08 89 3c 24 48 89 4c 24 18 e8 7d e2 f7 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 bd e2 f7 ff 48 8b
> RSP: 002b:00007fff220ed560 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
> RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007bcf4cd1278f
> RDX: 0000000000010000 RSI: 0000623b5c5f2000 RDI: 0000000000000003
> RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000005c0000 R11: 0000000000000293 R12: 00000000005c0000
> R13: 000000001fa40000 R14: 00000000005c0000 R15: 0000000000000000
> </TASK>
> watchdog: BUG: soft lockup - CPU#1 stuck for 417s! [xfs_io:226486]
>
> The transaction never recovers, and the system must be force restarted. Doing
> this can lose data not yet written to disk.
>
> There is no workaround, other than to build your kernel disabling large folio
> support for xfs.
>
> [Fix]
>
> The below patches fix the issue by more-or-less calling xas_reset() after
> xas_split_alloc(), which ensures the folio pointer list doesn't get corrupted
> if a race condition occurs.
>
> commit a4864671ca0bf51c8e78242951741df52c06766f
> Author: Kairui Song <kasong at tencent.com>
> Date: Tue Apr 16 01:18:55 2024 +0800
> Subject: lib/xarray: introduce a new helper xas_get_order
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4864671ca0bf51c8e78242951741df52c06766f
>
> commit de60fd8ddeda2b41fbe11df11733838c5f684616
> Author: Kairui Song <kasong at tencent.com>
> Date: Tue Apr 16 01:18:53 2024 +0800
> Subject: mm/filemap: return early if failed to allocate memory for split
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de60fd8ddeda2b41fbe11df11733838c5f684616
>
> commit 6758c1128ceb45d1a35298912b974eb4895b7dd9
> Author: Kairui Song <kasong at tencent.com>
> Date: Tue Apr 16 01:18:56 2024 +0800
> Subject: mm/filemap: optimize filemap folio adding
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6758c1128ceb45d1a35298912b974eb4895b7dd9
>
> These all landed in 6.10-rc1. For Noble, we will use the versions from upstream
> -stable 6.6.y directly from Greg KH. They contain minor backports from the
> mainline variants, but cherry pick directly to 6.8 noble.
>
> Only 6.1 or later is affected, so only noble needs the patches.
>
> [Testcase]
>
> You will need a disk attached to your VM, or a bare metal system with multiple
> disks.
>
> $ sudo lsblk
> vdc 253:32 0 10G 0 disk
> $ sudo mkfs.xfs /dev/vdc
>
> $ cat >> xfsfallout.bash << EOF
> #!/bin/bash
> sudo mount /dev/vdc /mnt
> for x in {0..8}; do sudo fallocate -l100m /mnt/file${x}; sudo ./reader /mnt/file${x} & done
> EOF
>
> $ cat >> reader.c << EOF
> /*
> * gcc -Wall -o reader reader.c -lpthread
> */
> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/mman.h>
> #include <sys/sendfile.h>
> #include <unistd.h>
> #include <errno.h>
> #include <err.h>
> #include <pthread.h>
>
> struct thread_data {
> int fd;
> size_t size;
> };
>
> static void *drop_pages(void *arg)
> {
> struct thread_data *td = arg;
> int ret;
> unsigned long nr_pages = td->size / 4096;
> unsigned int seed = 0x55443322;
> off_t offset;
> unsigned long nr_drops = 0;
>
> while (1) {
> offset = rand_r(&seed) % nr_pages;
> offset = offset * 4096;
> ret = posix_fadvise(td->fd, offset, 4096, POSIX_FADV_DONTNEED);
> if (ret < 0)
> err(1, "fadvise dontneed");
>
> /* every once and a while, drop everything */
> if (nr_drops > nr_pages / 2) {
> ret = posix_fadvise(td->fd, 0, td->size, POSIX_FADV_DONTNEED);
> if (ret < 0)
> err(1, "fadvise dontneed");
> fprintf(stderr, "+");
> nr_drops = 0;
> }
> nr_drops++;
> }
> return NULL;
> }
>
> #define READ_BUF (2 * 1024 * 1024)
> static void *read_pages(void *arg)
> {
> struct thread_data *td = arg;
> char buf[READ_BUF];
> ssize_t ret;
> loff_t offset;
>
> while (1) {
> offset = 0;
> while(offset < td->size) {
> ret = pread(td->fd, buf, READ_BUF, offset);
> if (ret < 0)
> err(1, "read");
> if (ret == 0)
> break;
> offset += ret;
> }
> }
> return NULL;
> }
>
> int main(int ac, char **av)
> {
> int fd;
> int ret;
> struct stat st;
> struct thread_data td;
> pthread_t drop_tid;
> pthread_t drop2_tid;
> pthread_t read_tid;
>
> if (ac != 2)
> err(1, "usage: reader filename\n");
>
> fd = open(av[1], O_RDONLY, 0600);
> if (fd < 0)
> err(1, "unable to open %s", av[1]);
>
> ret = fstat(fd, &st);
> if (ret < 0)
> err(1, "stat");
>
> td.fd = fd;
> td.size = st.st_size;
>
> ret = pthread_create(&drop_tid, NULL, drop_pages, &td);
> if (ret)
> err(1, "pthread_create");
> ret = pthread_create(&drop2_tid, NULL, drop_pages, &td);
> if (ret)
> err(1, "pthread_create");
> ret = pthread_create(&read_tid, NULL, read_pages, &td);
> if (ret)
> err(1, "pthread_create");
>
> pthread_join(drop_tid, NULL);
> pthread_join(drop2_tid, NULL);
> pthread_join(read_tid, NULL);
> }
> EOF
>
> $ sudo apt install build-essential
> $ gcc -Wall -o reader reader.c -lpthread
> $ chmod +x xfsfallout.bash
> $ ./xfsfallout.bash
>
> The kernel should hang in approximately 5 minutes or less.
>
> There is a test kernel available in the following ppas:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2085495-test
>
> If you install this, running the testcase should not hang the kernel, even
> running the testcase for hours on end.
>
> [Where problems could occur]
>
> We are changing how items are inserted to large folio filemaps lists, ideally
> removing the race condition that corrupts the pointer to older stale entries,
> and solving the xfs hang.
>
> This shouldn't have any impact to item removal.
>
> Large folios are seeing increasing amounts of use in the kernel, with the
> largest user being xfs. If a regression were to occur, it would likely happen
> to users with xfs filesystems.
>
> There is no workaround. Users would have to revert to a working kernel if a
> regression occurs.
>
> The patches do come with a slight performance increase, but the main reason for
> the patchset is to fix the pointer corruption bug.
>
> [Other info]
>
> Very detailed upstream mailing list thread:
> https://lore.kernel.org/linux-mm/20240913-ortsausgang-baustart-1dae9a18254d@brauner/T/
>
> Kairui Song (3):
> lib/xarray: introduce a new helper xas_get_order
> mm/filemap: return early if failed to allocate memory for split
> mm/filemap: optimize filemap folio adding
>
> include/linux/xarray.h | 6 +++
> lib/test_xarray.c | 93 ++++++++++++++++++++++++++++++++++++++++++
> lib/xarray.c | 49 ++++++++++++++--------
> mm/filemap.c | 50 ++++++++++++++++++-----
> 4 files changed, 169 insertions(+), 29 deletions(-)
>
Applied to noble:linux/master-next. Thanks.
-Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20241025/eba71591/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20241025/eba71591/attachment-0001.sig>
More information about the kernel-team
mailing list