Cmnt: [SRU][N/O/P][PATCH 0/1] MGLRU: page allocation failure on NUMA-enabled systems
Koichiro Den
koichiro.den at canonical.com
Wed Feb 12 15:25:56 UTC 2025
On Wed, Feb 12, 2025 at 11:33:00AM GMT, Heitor Alves de Siqueira wrote:
> On Wed, Feb 12, 2025 at 9:36 AM Koichiro Den <koichiro.den at canonical.com>
> wrote:
>
> > On Wed, Feb 12, 2025 at 09:14:27AM GMT, Heitor Alves de Siqueira wrote:
> > > Hi Koichiro,
> > >
> > > thanks for looking into this! Yes, I've used the attached scripts to
> > > reproduce the issue successfully, although only in aarch64 systems
> > > (specifically, I've used Grace-Grace for my tests).
> > > I've not been able to reproduce this reliably in x86 or other
> > > architectures, and using 64k page sizes also makes this much
> > faster/easier
> > > to reproduce.
> >
> > Thanks for the reply. Just let me confirm; when you verified that you
> > reproduced it, you confirmed that there were large number of dirty folios
> > in the LRU list for the coldest gen for FILE (not ANON), right?
>
>
> Here's a stack trace from the latest reproducer run I did earlier this
> morning, using kernel 6.8.0-53-generic-64k from Noble:
>
> [ 124.550628] alloc_and_crash: page allocation failure: order:0,
> mode:0x141cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_WRITE),
> nodemask=0,cpuset=/,mems_allowed=0-1ion 0:0x00000000221c0000
> [ 124.550648] CPU: 135 PID: 3406 Comm: alloc_and_crash Not tainted
> 6.8.0-53-generic-64k #55-Ubuntu
> [ 124.550651] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c
> 12/28/2023
> [ 124.550653] Call trace:
> [ 124.550656] dump_backtrace+0xa4/0x150
> [ 124.550665] show_stack+0x24/0x50
> [ 124.550667] dump_stack_lvl+0xc8/0x138
> [ 124.550671] dump_stack+0x1c/0x38
> [ 124.550672] warn_alloc+0x16c/0x1f0
> [ 124.550677] __alloc_pages_slowpath.constprop.0+0x8e4/0x9f0
> [ 124.550679] __alloc_pages+0x2f0/0x3a8
> [ 124.550680] alloc_pages_mpol+0x94/0x290
> [ 124.550685] alloc_pages+0x6c/0x118
> [ 124.550687] folio_alloc+0x24/0x98
> [ 124.550689] filemap_alloc_folio+0x168/0x188
> [ 124.550692] __filemap_get_folio+0x1bc/0x3f8
> [ 124.550694] ext4_da_write_begin+0x144/0x300
> [ 124.550697] generic_perform_write+0xc4/0x228
> [ 124.550699] ext4_buffered_write_iter+0x78/0x180
> [ 124.550701] ext4_file_write_iter+0x44/0xf0
> [ 124.550702] __kernel_write_iter+0x10c/0x2c0
> [ 124.550704] dump_user_range+0xe0/0x240
> [ 124.550707] elf_core_dump+0x4cc/0x538
> [ 124.550709] do_coredump+0x574/0x988
> [ 124.550711] get_signal+0x7dc/0x8f0
> [ 124.550713] do_signal+0x138/0x1f8
> [ 124.550715] do_notify_resume+0x114/0x298
> [ 124.550716] el0_da+0xdc/0x178
> [ 124.550719] el0t_64_sync_handler+0xdc/0x158
> [ 124.550721] el0t_64_sync+0x1b0/0x1b8
> [ 124.550723] Mem-Info:
> [ 124.550728] active_anon:3921 inactive_anon:3473262 isolated_anon:0
> active_file:933 inactive_file:252531 isolated_file:0
> unevictable:609 dirty:241262 writeback:0
> slab_reclaimable:9234 slab_unreclaimable:35922
> mapped:3472425 shmem:3474488 pagetables:624
> sec_pagetables:0 bounce:0
> kernel_misc_reclaimable:0
> free:4031494 free_pcp:0 free_cma:48
> [ 124.550733] Node 0 active_anon:206656kB inactive_anon:222288768kB
> active_file:1728kB inactive_file:15437504kB unevictable:9024kB
> isolated(anon):0kB isolated(file):0kB mapped:222210880kB dirty:15437568kB
> writeback:0kB shmem:222337216kB shmem_thp:0kB shmem_pmdmapped:0kB
> anon_thp:0kB writeback_tmp:0
> kB kernel_stack:51584kB shadow_call_stack:66368kB pagetables:38016kB
> sec_pagetables:0kB all_unreclaimable? yes
> [ 124.550738] Node 0 DMA free:1041984kB boost:0kB min:69888kB low:87360kB
> high:104832kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:393472kB unevictable:0kB
> writepending:394112kB present:2097152kB managed:2029632kB mlocked:0kB
> bounce:0kB free_pcp:0kB loca
> l_pcp:0kB free_cma:3072kB
> [ 124.550742] lowmem_reserve[]: 0 0 15189 15189 15189
> [ 124.550747] Node 0 Normal free:8574848kB boost:0kB min:8575808kB
> low:10719744kB high:12863680kB reserved_highatomic:0KB active_anon:206656kB
> inactive_anon:222288768kegion 0:0x0000000022580000
> B active_file:1728kB inactive_file:15044032kB unevictable:9024kB
> writepending:15043456kB present:249244544kB managed:248932800kB mlocked:0kB
> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB[ 124.550750]
> lowmem_reserve[]: 0 0 0 0 0
> [ 124.550754] Node 0 DMA: 5*64kB (ME) 4*128kB (ME) 1*256kB (U) 7*512kB
> (UE) 5*1024kB (UMEC) 2*2048kB (UC) 3*4096kB (UME) 2*8192kB (ME) 3*16384kB
> (UME) 3*32768kB (UME) 1*65536kB (U) 2*131072kB (UE) 2*262144kB (UE)
> 0*524288kB = 1041984kB
> [ 124.550769] Node 0 Normal: 726*64kB (UME) 392*128kB (UME) 246*256kB (UE)
> 138*512kB (UME) 65*1024kB (UE) 48*2048kB (UME) 19*4096kB (UE) 7*8192kB
> (UME) 5*16384kB (U) 3*32768kB (UM) 2*65536kB (ME) 1*131072kB (E) 1*262144kB
> (M) 14*524288kB (M) = 8574848kB
> [ 124.550786] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=16777216kB
> [ 124.550788] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=524288kB
> [ 124.550789] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
> [ 124.550790] 3729522 total pagecache pages
> [ 124.550792] 1406 pages in swap cache
> [ 124.550793] Free swap = 0kB
> [ 124.550794] Total swap = 8388544kB
> [ 124.550795] 7858556 pages RAM
> [ 124.550796] 0 pages HighMem/MovableOnly
> [ 124.550796] 12342 pages reserved
> [ 124.550797] 8192 pages cma reserved
> [ 124.550798] 0 pages hwpoisoned
>
> And here's /proc/meminfo from just before the crash:
> MemTotal: 502157696 kB
> MemFree: 258273600 kB
> MemAvailable: 236229312 kB
> Buffers: 29632 kB
> Cached: 237187456 kB
> SwapCached: 1374848 kB
> Active: 9878912 kB
> Inactive: 228723776 kB
> Active(anon): 1307520 kB
> Inactive(anon): 227959296 kB
> Active(file): 8571392 kB
> Inactive(file): 764480 kB
> Unevictable: 38976 kB
> Mlocked: 29952 kB
> SwapTotal: 8388544 kB
> SwapFree: 5436224 kB
> Zswap: 0 kB
> Zswapped: 0 kB
> Dirty: 8519168 kB
> Writeback: 1250368 kB
> AnonPages: 79424 kB
> Mapped: 227857920 kB
> Shmem: 227861632 kB
> KReclaimable: 423680 kB
> Slab: 2767232 kB
> SReclaimable: 423680 kB
> SUnreclaim: 2343552 kB
> KernelStack: 93440 kB
> ShadowCallStack: 121088 kB
> PageTables: 40640 kB
> SecPageTables: 0 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 259467392 kB
> Committed_AS: 231067456 kB
> VmallocTotal: 137168158720 kB
> VmallocUsed: 567680 kB
> VmallocChunk: 0 kB
> Percpu: 156672 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 0 kB
> ShmemHugePages: 0 kB
> ShmemPmdMapped: 0 kB
> FileHugePages: 0 kB
> FilePmdMapped: 0 kB
> CmaTotal: 524288 kB
> CmaFree: 3072 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 524288 kB
> Hugetlb: 0 kB
>
> So while the number of ANON is much higher (due to how we setup the
> reproducer), we can still cause the page allocation failures with enough
> pressure on the LRU lists.
Thanks for the very useful information. Now it makes sense, it's the
machine with such a large amount of memory. Relatively small portion of
file-backed dirty folios, which are even much less than the normal dirty
background ratio, can fill up the number to scan and writeback does not
proceed at all for them.
>
> Could you answer the rest of my questions in the previous email?
> >
>
> Sure!
> I did use those scripts on the LP bug to reproduce it successfully, with
> the caveats I mentioned previously (only on aarch64, and easier on 64k
> pages).
> I landed on the mentioned fix commit by bisecting the upstream kernel
> (Linus' tree), and confirmed the issue does not happen when cherry-picking
> commit 1bc542c6a0d1 into Ubuntu kernels. I've validated this for Noble,
> Oracular and Plucky.
Thanks for the additional info. Let me add my Acked-by shortly.
>
> Let me know if you need any more info on this!
>
> > [...]
> > > Also, did you confirm that the issue was resolved after applying the
> > patch
> > > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> > > list for ANON, not FILE.
> >
> > >
> > > On Wed, Feb 12, 2025 at 1:37 AM Koichiro Den <koichiro.den at canonical.com
> > >
> > > wrote:
> > >
> > > > On Sun, Feb 02, 2025 at 12:21:50PM GMT, Heitor Alves de Siqueira wrote:
> > > > > BugLink: https://bugs.launchpad.net/bugs/2097214
> > > > >
> > > > > [Impact]
> > > > > * On MGLRU-enabled systems, high memory pressure on NUMA nodes will
> > > > cause page
> > > > > allocation failures
> > > > > * This happens due to page reclaim not waking up flusher threads
> > > > > * OOM can be triggered even if the system has enough available
> > memory
> > > > >
> > > > > [Test Plan]
> > > > > * For the bug to properly trigger, we should uninstall apport and
> > use
> > > > the
> > > > > attached alloc_and_crash.c reproducer
> > > > > * alloc_and_crash will mmap a huge range of memory, memset it and
> > > > forcibly SEGFAULT
> > > > > * The attached bash script will membind alloc_and_crash to NUMA
> > node 0,
> > > > so we
> > > > > can see the allocation failures in dmesg
> > > > > $ sudo apt remove --purge apport
> > > > > $ sudo dmesg -c; ./lp2097214-repro.sh; sleep 2; sudo dmesg
> > > >
> > > > I looked over the attached files (alloc_and_crash.c and
> > > > lp2097214-repro.sh).
> > > >
> > > > Question:
> > > > Did you use them to reproduce the issue that you want to resolve here?
> > > > Also, did you confirm that the issue was resolved after applying the
> > patch
> > > > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> > > > list for ANON, not FILE.
> > > >
> > > > >
> > > > > [Fix]
> > > > > * The upstream patch wakes up flusher threads if there are too many
> > > > dirty
> > > > > entries in the coldest LRU generation
> > > > > * This happens when trying to shrink lruvecs, so reclaim only gets
> > > > woken up
> > > > > during high memory pressure
> > > > > * Fix was introduced by commit:
> > > > > 1bc542c6a0d1 mm/vmscan: wake up flushers conditionally to avoid
> > > > cgroup OOM
> > > > >
> > > > > [Regression Potential]
> > > > > * This commit fixes the memory reclaim path, so regressions would
> > > > likely show
> > > > > up during increased system memory pressure
> > > > > * According to the upstream patch, increased SSD/disk wearing is
> > > > possible due
> > > > > to waking up flusher threads, although these have not been noted
> > in
> > > > testing
> > > > >
> > > > > Zeng Jingxiang (1):
> > > > > mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
> > > > >
> > > > > mm/vmscan.c | 25 ++++++++++++++++++++++---
> > > > > 1 file changed, 22 insertions(+), 3 deletions(-)
> > > > >
> > > > > --
> > > > > 2.48.1
> > > > >
> > > > >
> > > > > --
> > > > > kernel-team mailing list
> > > > > kernel-team at lists.ubuntu.com
> > > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> > > >
> >
More information about the kernel-team
mailing list