Cmnt: [SRU][N/O/P][PATCH 0/1] MGLRU: page allocation failure on NUMA-enabled systems

Wed Feb 12 15:25:56 UTC 2025

On Wed, Feb 12, 2025 at 11:33:00AM GMT, Heitor Alves de Siqueira wrote:
> On Wed, Feb 12, 2025 at 9:36 AM Koichiro Den <koichiro.den at canonical.com>
> wrote:
> 
> > On Wed, Feb 12, 2025 at 09:14:27AM GMT, Heitor Alves de Siqueira wrote:
> > > Hi Koichiro,
> > >
> > > thanks for looking into this! Yes, I've used the attached scripts to
> > > reproduce the issue successfully, although only in aarch64 systems
> > > (specifically, I've used Grace-Grace for my tests).
> > > I've not been able to reproduce this reliably in x86 or other
> > > architectures, and using 64k page sizes also makes this much
> > faster/easier
> > > to reproduce.
> >
> > Thanks for the reply. Just let me confirm; when you verified that you
> > reproduced it, you confirmed that there were large number of dirty folios
> > in the LRU list for the coldest gen for FILE (not ANON), right?
> 
> 
> Here's a stack trace from the latest reproducer run I did earlier this
> morning, using kernel 6.8.0-53-generic-64k from Noble:
> 
> [  124.550628] alloc_and_crash: page allocation failure: order:0,
> mode:0x141cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_WRITE),
> nodemask=0,cpuset=/,mems_allowed=0-1ion 0:0x00000000221c0000
> [  124.550648] CPU: 135 PID: 3406 Comm: alloc_and_crash Not tainted
> 6.8.0-53-generic-64k #55-Ubuntu
> [  124.550651] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c
> 12/28/2023
> [  124.550653] Call trace:
> [  124.550656]  dump_backtrace+0xa4/0x150
> [  124.550665]  show_stack+0x24/0x50
> [  124.550667]  dump_stack_lvl+0xc8/0x138
> [  124.550671]  dump_stack+0x1c/0x38
> [  124.550672]  warn_alloc+0x16c/0x1f0
> [  124.550677]  __alloc_pages_slowpath.constprop.0+0x8e4/0x9f0
> [  124.550679]  __alloc_pages+0x2f0/0x3a8
> [  124.550680]  alloc_pages_mpol+0x94/0x290
> [  124.550685]  alloc_pages+0x6c/0x118
> [  124.550687]  folio_alloc+0x24/0x98
> [  124.550689]  filemap_alloc_folio+0x168/0x188
> [  124.550692]  __filemap_get_folio+0x1bc/0x3f8
> [  124.550694]  ext4_da_write_begin+0x144/0x300
> [  124.550697]  generic_perform_write+0xc4/0x228
> [  124.550699]  ext4_buffered_write_iter+0x78/0x180
> [  124.550701]  ext4_file_write_iter+0x44/0xf0
> [  124.550702]  __kernel_write_iter+0x10c/0x2c0
> [  124.550704]  dump_user_range+0xe0/0x240
> [  124.550707]  elf_core_dump+0x4cc/0x538
> [  124.550709]  do_coredump+0x574/0x988
> [  124.550711]  get_signal+0x7dc/0x8f0
> [  124.550713]  do_signal+0x138/0x1f8
> [  124.550715]  do_notify_resume+0x114/0x298
> [  124.550716]  el0_da+0xdc/0x178
> [  124.550719]  el0t_64_sync_handler+0xdc/0x158
> [  124.550721]  el0t_64_sync+0x1b0/0x1b8
> [  124.550723] Mem-Info:
> [  124.550728] active_anon:3921 inactive_anon:3473262 isolated_anon:0
>                 active_file:933 inactive_file:252531 isolated_file:0
>                 unevictable:609 dirty:241262 writeback:0
>                 slab_reclaimable:9234 slab_unreclaimable:35922
>                 mapped:3472425 shmem:3474488 pagetables:624
>                 sec_pagetables:0 bounce:0
>                 kernel_misc_reclaimable:0
>                 free:4031494 free_pcp:0 free_cma:48
> [  124.550733] Node 0 active_anon:206656kB inactive_anon:222288768kB
> active_file:1728kB inactive_file:15437504kB unevictable:9024kB
> isolated(anon):0kB isolated(file):0kB mapped:222210880kB dirty:15437568kB
> writeback:0kB shmem:222337216kB shmem_thp:0kB shmem_pmdmapped:0kB
> anon_thp:0kB writeback_tmp:0
> kB kernel_stack:51584kB shadow_call_stack:66368kB pagetables:38016kB
> sec_pagetables:0kB all_unreclaimable? yes
> [  124.550738] Node 0 DMA free:1041984kB boost:0kB min:69888kB low:87360kB
> high:104832kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:393472kB unevictable:0kB
> writepending:394112kB present:2097152kB managed:2029632kB mlocked:0kB
> bounce:0kB free_pcp:0kB loca
> l_pcp:0kB free_cma:3072kB
> [  124.550742] lowmem_reserve[]: 0 0 15189 15189 15189
> [  124.550747] Node 0 Normal free:8574848kB boost:0kB min:8575808kB
> low:10719744kB high:12863680kB reserved_highatomic:0KB active_anon:206656kB
> inactive_anon:222288768kegion 0:0x0000000022580000
> B active_file:1728kB inactive_file:15044032kB unevictable:9024kB
> writepending:15043456kB present:249244544kB managed:248932800kB mlocked:0kB
> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB[  124.550750]
> lowmem_reserve[]: 0 0 0 0 0
> [  124.550754] Node 0 DMA: 5*64kB (ME) 4*128kB (ME) 1*256kB (U) 7*512kB
> (UE) 5*1024kB (UMEC) 2*2048kB (UC) 3*4096kB (UME) 2*8192kB (ME) 3*16384kB
> (UME) 3*32768kB (UME) 1*65536kB (U) 2*131072kB (UE) 2*262144kB (UE)
> 0*524288kB = 1041984kB
> [  124.550769] Node 0 Normal: 726*64kB (UME) 392*128kB (UME) 246*256kB (UE)
> 138*512kB (UME) 65*1024kB (UE) 48*2048kB (UME) 19*4096kB (UE) 7*8192kB
> (UME) 5*16384kB (U) 3*32768kB (UM) 2*65536kB (ME) 1*131072kB (E) 1*262144kB
> (M) 14*524288kB (M) = 8574848kB
> [  124.550786] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=16777216kB
> [  124.550788] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=524288kB
> [  124.550789] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
> [  124.550790] 3729522 total pagecache pages
> [  124.550792] 1406 pages in swap cache
> [  124.550793] Free swap  = 0kB
> [  124.550794] Total swap = 8388544kB
> [  124.550795] 7858556 pages RAM
> [  124.550796] 0 pages HighMem/MovableOnly
> [  124.550796] 12342 pages reserved
> [  124.550797] 8192 pages cma reserved
> [  124.550798] 0 pages hwpoisoned
> 
> And here's /proc/meminfo from just before the crash:
> MemTotal:       502157696 kB
> MemFree:        258273600 kB
> MemAvailable:   236229312 kB
> Buffers:           29632 kB
> Cached:         237187456 kB
> SwapCached:      1374848 kB
> Active:          9878912 kB
> Inactive:       228723776 kB
> Active(anon):    1307520 kB
> Inactive(anon): 227959296 kB
> Active(file):    8571392 kB
> Inactive(file):   764480 kB
> Unevictable:       38976 kB
> Mlocked:           29952 kB
> SwapTotal:       8388544 kB
> SwapFree:        5436224 kB
> Zswap:                 0 kB
> Zswapped:              0 kB
> Dirty:           8519168 kB
> Writeback:       1250368 kB
> AnonPages:         79424 kB
> Mapped:         227857920 kB
> Shmem:          227861632 kB
> KReclaimable:     423680 kB
> Slab:            2767232 kB
> SReclaimable:     423680 kB
> SUnreclaim:      2343552 kB
> KernelStack:       93440 kB
> ShadowCallStack:  121088 kB
> PageTables:        40640 kB
> SecPageTables:         0 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    259467392 kB
> Committed_AS:   231067456 kB
> VmallocTotal:   137168158720 kB
> VmallocUsed:      567680 kB
> VmallocChunk:          0 kB
> Percpu:           156672 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:        0 kB
> ShmemPmdMapped:        0 kB
> FileHugePages:         0 kB
> FilePmdMapped:         0 kB
> CmaTotal:         524288 kB
> CmaFree:            3072 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:     524288 kB
> Hugetlb:               0 kB
> 
> So while the number of ANON is much higher (due to how we setup the
> reproducer), we can still cause the page allocation failures with enough
> pressure on the LRU lists.

Thanks for the very useful information. Now it makes sense, it's the
machine with such a large amount of memory. Relatively small portion of
file-backed dirty folios, which are even much less than the normal dirty
background ratio, can fill up the number to scan and writeback does not
proceed at all for them.

> 
> Could you answer the rest of my questions in the previous email?
> >
> 
> Sure!
> I did use those scripts on the LP bug to reproduce it successfully, with
> the caveats I mentioned previously (only on aarch64, and easier on 64k
> pages).
> I landed on the mentioned fix commit by bisecting the upstream kernel
> (Linus' tree), and confirmed the issue does not happen when cherry-picking
> commit 1bc542c6a0d1 into Ubuntu kernels. I've validated this for Noble,
> Oracular and Plucky.

Thanks for the additional info. Let me add my Acked-by shortly.

> 
> Let me know if you need any more info on this!
> 
>   > [...]
> >   > Also, did you confirm that the issue was resolved after applying the
> > patch
> >   > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> >   > list for ANON, not FILE.
> >
> > >
> > > On Wed, Feb 12, 2025 at 1:37 AM Koichiro Den <koichiro.den at canonical.com
> > >
> > > wrote:
> > >
> > > > On Sun, Feb 02, 2025 at 12:21:50PM GMT, Heitor Alves de Siqueira wrote:
> > > > > BugLink: https://bugs.launchpad.net/bugs/2097214
> > > > >
> > > > > [Impact]
> > > > >  * On MGLRU-enabled systems, high memory pressure on NUMA nodes will
> > > > cause page
> > > > >    allocation failures
> > > > >  * This happens due to page reclaim not waking up flusher threads
> > > > >  * OOM can be triggered even if the system has enough available
> > memory
> > > > >
> > > > > [Test Plan]
> > > > >  * For the bug to properly trigger, we should uninstall apport and
> > use
> > > > the
> > > > >    attached alloc_and_crash.c reproducer
> > > > >  * alloc_and_crash will mmap a huge range of memory, memset it and
> > > > forcibly SEGFAULT
> > > > >  * The attached bash script will membind alloc_and_crash to NUMA
> > node 0,
> > > > so we
> > > > >    can see the allocation failures in dmesg
> > > > >    $ sudo apt remove --purge apport
> > > > >    $ sudo dmesg -c; ./lp2097214-repro.sh; sleep 2; sudo dmesg
> > > >
> > > > I looked over the attached files (alloc_and_crash.c and
> > > > lp2097214-repro.sh).
> > > >
> > > > Question:
> > > > Did you use them to reproduce the issue that you want to resolve here?
> > > > Also, did you confirm that the issue was resolved after applying the
> > patch
> > > > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> > > > list for ANON, not FILE.
> > > >
> > > > >
> > > > > [Fix]
> > > > >  * The upstream patch wakes up flusher threads if there are too many
> > > > dirty
> > > > >    entries in the coldest LRU generation
> > > > >  * This happens when trying to shrink lruvecs, so reclaim only gets
> > > > woken up
> > > > >    during high memory pressure
> > > > >  * Fix was introduced by commit:
> > > > >      1bc542c6a0d1 mm/vmscan: wake up flushers conditionally to avoid
> > > > cgroup OOM
> > > > >
> > > > > [Regression Potential]
> > > > >  * This commit fixes the memory reclaim path, so regressions would
> > > > likely show
> > > > >    up during increased system memory pressure
> > > > >  * According to the upstream patch, increased SSD/disk wearing is
> > > > possible due
> > > > >    to waking up flusher threads, although these have not been noted
> > in
> > > > testing
> > > > >
> > > > > Zeng Jingxiang (1):
> > > > >   mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
> > > > >
> > > > >  mm/vmscan.c | 25 ++++++++++++++++++++++---
> > > > >  1 file changed, 22 insertions(+), 3 deletions(-)
> > > > >
> > > > > --
> > > > > 2.48.1
> > > > >
> > > > >
> > > > > --
> > > > > kernel-team mailing list
> > > > > kernel-team at lists.ubuntu.com
> > > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> > > >
> >