[SRU][Noble][PATCH 0/3] mm/folios: xfs hangs with hung task timeouts with corrupted folio pointer lists

Matthew Ruffell matthew.ruffell at canonical.com
Thu Oct 24 08:43:15 UTC 2024


BugLink: https://bugs.launchpad.net/bugs/2085495

[Impact]

A long running, and incredibly difficult to reproduce large folio issue leads to
hung task timeouts in the xfs subsystem with the following stack trace:

CPU: 0 PID: 226487 Comm: xfs_io Tainted: G             L     6.5.0-41-generic #41~22.04.2-Ubuntu
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:xas_descend+0x25/0xd0
Code: 90 90 90 90 90 55 48 89 e5 41 56 41 55 49 89 fd 41 54 49 89 f4 53 48 83 ec 08 0f b6 0e 48 8b 5f 08 80 f9 3f 0f 87 5d 2f 07 00 <48> d3 eb 83 e3 3f 89 d8 48 83 c0 04 49 8b 44 c4 08 4d 89 65 18 48
RSP: 0018:ffffaf9b44927a68 EFLAGS: 00000293
RAX: ffff8d61568f36d2 RBX: 00000000000005c0 RCX: 0000000000000006
RDX: 0000000000000002 RSI: ffff8d61568f36d0 RDI: ffffaf9b44927b10
RBP: ffffaf9b44927a90 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8d6159120938 R11: 0000000000000000 R12: ffff8d61568f36d0
R13: ffffaf9b44927b10 R14: ffffaf9b44927e30 R15: ffffaf9b44927e08
FS:  00007bcf4ce2c840(0000) GS:ffff8d61be400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007bcf4ca80df8 CR3: 0000000008524005 CR4: 0000000000370ef0
Call Trace:
 <IRQ>
 ? show_regs+0x6d/0x80
 ? watchdog_timer_fn+0x1d8/0x240
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x10f/0x2a0
 ? kvm_clock_get_cycles+0x18/0x40
 ? hrtimer_interrupt+0xf6/0x250
 ? __sysvec_apic_timer_interrupt+0x5f/0x140
 ? sysvec_apic_timer_interrupt+0x8d/0xd0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? xas_descend+0x25/0xd0
 xas_load+0x4c/0x60
 __xas_next+0xa9/0x150
 filemap_get_read_batch+0x1a3/0x2e0
 filemap_get_pages+0xa9/0x3b0
 ? touch_atime+0x44/0x1c0
 filemap_read+0xe7/0x430
 generic_file_read_iter+0xbb/0x110
 ? down_read+0x12/0xc0
 xfs_file_buffered_read+0x57/0xe0 [xfs]
 xfs_file_read_iter+0xb6/0x1c0 [xfs]
 ? security_file_permission+0x5f/0x70
 vfs_read+0x20a/0x360
 __x64_sys_pread64+0xa6/0xd0
 x64_sys_call+0x1e01/0x20b0
 do_syscall_64+0x55/0x90
 ? do_syscall_64+0x61/0x90
 ? syscall_exit_to_user_mode+0x37/0x60
 ? do_syscall_64+0x61/0x90
 entry_SYSCALL_64_after_hwframe+0x73/0xdd
RIP: 0033:0x7bcf4cd1278f
Code: 08 89 3c 24 48 89 4c 24 18 e8 7d e2 f7 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 bd e2 f7 ff 48 8b
RSP: 002b:00007fff220ed560 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007bcf4cd1278f
RDX: 0000000000010000 RSI: 0000623b5c5f2000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000005c0000 R11: 0000000000000293 R12: 00000000005c0000
R13: 000000001fa40000 R14: 00000000005c0000 R15: 0000000000000000
 </TASK>
watchdog: BUG: soft lockup - CPU#1 stuck for 417s! [xfs_io:226486]

The transaction never recovers, and the system must be force restarted. Doing
this can lose data not yet written to disk.

There is no workaround, other than to build your kernel disabling large folio
support for xfs.

[Fix]

The below patches fix the issue by more-or-less calling xas_reset() after 
xas_split_alloc(), which ensures the folio pointer list doesn't get corrupted
if a race condition occurs.

commit a4864671ca0bf51c8e78242951741df52c06766f
Author: Kairui Song <kasong at tencent.com>
Date:   Tue Apr 16 01:18:55 2024 +0800
Subject: lib/xarray: introduce a new helper xas_get_order
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4864671ca0bf51c8e78242951741df52c06766f

commit de60fd8ddeda2b41fbe11df11733838c5f684616
Author: Kairui Song <kasong at tencent.com>
Date:   Tue Apr 16 01:18:53 2024 +0800
Subject: mm/filemap: return early if failed to allocate memory for split
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de60fd8ddeda2b41fbe11df11733838c5f684616

commit 6758c1128ceb45d1a35298912b974eb4895b7dd9
Author: Kairui Song <kasong at tencent.com>
Date:   Tue Apr 16 01:18:56 2024 +0800
Subject: mm/filemap: optimize filemap folio adding
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6758c1128ceb45d1a35298912b974eb4895b7dd9

These all landed in 6.10-rc1. For Noble, we will use the versions from upstream
-stable 6.6.y directly from Greg KH. They contain minor backports from the
mainline variants, but cherry pick directly to 6.8 noble.

Only 6.1 or later is affected, so only noble needs the patches.

[Testcase]

You will need a disk attached to your VM, or a bare metal system with multiple
disks.

$ sudo lsblk
vdc     253:32   0   10G  0 disk
$ sudo mkfs.xfs /dev/vdc

$ cat >> xfsfallout.bash << EOF
#!/bin/bash
sudo mount /dev/vdc /mnt
for x in {0..8}; do sudo fallocate -l100m /mnt/file${x}; sudo ./reader /mnt/file${x} & done
EOF

$ cat >> reader.c << EOF
/*
 * gcc -Wall -o reader reader.c -lpthread
 */
#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/sendfile.h>
#include <unistd.h>
#include <errno.h>
#include <err.h>
#include <pthread.h>

struct thread_data {
    int fd;
    size_t size;
};

static void *drop_pages(void *arg)
{
    struct thread_data *td = arg;
    int ret;
    unsigned long nr_pages = td->size / 4096;
    unsigned int seed = 0x55443322;
    off_t offset;
    unsigned long nr_drops = 0;

    while (1) {
        offset = rand_r(&seed) % nr_pages;
        offset = offset * 4096;
        ret = posix_fadvise(td->fd,  offset, 4096, POSIX_FADV_DONTNEED);
        if (ret < 0)
            err(1, "fadvise dontneed");

        /* every once and a while, drop everything */
        if (nr_drops > nr_pages / 2) {
            ret = posix_fadvise(td->fd,  0, td->size, POSIX_FADV_DONTNEED);
            if (ret < 0)
                err(1, "fadvise dontneed");
            fprintf(stderr, "+");
            nr_drops = 0;
        }
        nr_drops++;
    }
    return NULL;
}

#define READ_BUF (2 * 1024 * 1024)
static void *read_pages(void *arg)
{
    struct thread_data *td = arg;
    char buf[READ_BUF];
    ssize_t ret;
    loff_t offset;

    while (1) {
        offset = 0;
        while(offset < td->size) {
            ret = pread(td->fd, buf, READ_BUF, offset);
            if (ret < 0)
                err(1, "read");
            if (ret == 0)
                break;
            offset += ret;
        }
    }
    return NULL;
}

int main(int ac, char **av)
{
    int fd;
    int ret;
    struct stat st;
    struct thread_data td;
    pthread_t drop_tid;
    pthread_t drop2_tid;
    pthread_t read_tid;

    if (ac != 2)
        err(1, "usage: reader filename\n");

    fd = open(av[1], O_RDONLY, 0600);
    if (fd < 0)
        err(1, "unable to open %s", av[1]);

    ret = fstat(fd, &st);
    if (ret < 0)
        err(1, "stat");

    td.fd = fd;
    td.size = st.st_size;

    ret = pthread_create(&drop_tid, NULL, drop_pages, &td);
    if (ret)
        err(1, "pthread_create");
    ret = pthread_create(&drop2_tid, NULL, drop_pages, &td);
    if (ret)
        err(1, "pthread_create");
    ret = pthread_create(&read_tid, NULL, read_pages, &td);
    if (ret)
        err(1, "pthread_create");

    pthread_join(drop_tid, NULL);
    pthread_join(drop2_tid, NULL);
    pthread_join(read_tid, NULL);
}
EOF

$ sudo apt install build-essential
$ gcc -Wall -o reader reader.c -lpthread
$ chmod +x xfsfallout.bash
$ ./xfsfallout.bash

The kernel should hang in approximately 5 minutes or less.

There is a test kernel available in the following ppas:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2085495-test

If you install this, running the testcase should not hang the kernel, even
running the testcase for hours on end.

[Where problems could occur]

We are changing how items are inserted to large folio filemaps lists, ideally
removing the race condition that corrupts the pointer to older stale entries,
and solving the xfs hang. 

This shouldn't have any impact to item removal.

Large folios are seeing increasing amounts of use in the kernel, with the
largest user being xfs. If a regression were to occur, it would likely happen
to users with xfs filesystems. 

There is no workaround. Users would have to revert to a working kernel if a
regression occurs.

The patches do come with a slight performance increase, but the main reason for
the patchset is to fix the pointer corruption bug.

[Other info]

Very detailed upstream mailing list thread:
https://lore.kernel.org/linux-mm/20240913-ortsausgang-baustart-1dae9a18254d@brauner/T/

Kairui Song (3):
  lib/xarray: introduce a new helper xas_get_order
  mm/filemap: return early if failed to allocate memory for split
  mm/filemap: optimize filemap folio adding

 include/linux/xarray.h |  6 +++
 lib/test_xarray.c      | 93 ++++++++++++++++++++++++++++++++++++++++++
 lib/xarray.c           | 49 ++++++++++++++--------
 mm/filemap.c           | 50 ++++++++++++++++++-----
 4 files changed, 169 insertions(+), 29 deletions(-)

-- 
2.45.2




More information about the kernel-team mailing list