Cmnt: [SRU][N/O/P][PATCH 0/1] MGLRU: page allocation failure on NUMA-enabled systems

Wed Feb 12 14:33:00 UTC 2025

On Wed, Feb 12, 2025 at 9:36 AM Koichiro Den <koichiro.den at canonical.com>
wrote:

> On Wed, Feb 12, 2025 at 09:14:27AM GMT, Heitor Alves de Siqueira wrote:
> > Hi Koichiro,
> >
> > thanks for looking into this! Yes, I've used the attached scripts to
> > reproduce the issue successfully, although only in aarch64 systems
> > (specifically, I've used Grace-Grace for my tests).
> > I've not been able to reproduce this reliably in x86 or other
> > architectures, and using 64k page sizes also makes this much
> faster/easier
> > to reproduce.
>
> Thanks for the reply. Just let me confirm; when you verified that you
> reproduced it, you confirmed that there were large number of dirty folios
> in the LRU list for the coldest gen for FILE (not ANON), right?

Here's a stack trace from the latest reproducer run I did earlier this
morning, using kernel 6.8.0-53-generic-64k from Noble:

[  124.550628] alloc_and_crash: page allocation failure: order:0,
mode:0x141cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_WRITE),
nodemask=0,cpuset=/,mems_allowed=0-1ion 0:0x00000000221c0000
[  124.550648] CPU: 135 PID: 3406 Comm: alloc_and_crash Not tainted
6.8.0-53-generic-64k #55-Ubuntu
[  124.550651] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c
12/28/2023
[  124.550653] Call trace:
[  124.550656]  dump_backtrace+0xa4/0x150
[  124.550665]  show_stack+0x24/0x50
[  124.550667]  dump_stack_lvl+0xc8/0x138
[  124.550671]  dump_stack+0x1c/0x38
[  124.550672]  warn_alloc+0x16c/0x1f0
[  124.550677]  __alloc_pages_slowpath.constprop.0+0x8e4/0x9f0
[  124.550679]  __alloc_pages+0x2f0/0x3a8
[  124.550680]  alloc_pages_mpol+0x94/0x290
[  124.550685]  alloc_pages+0x6c/0x118
[  124.550687]  folio_alloc+0x24/0x98
[  124.550689]  filemap_alloc_folio+0x168/0x188
[  124.550692]  __filemap_get_folio+0x1bc/0x3f8
[  124.550694]  ext4_da_write_begin+0x144/0x300
[  124.550697]  generic_perform_write+0xc4/0x228
[  124.550699]  ext4_buffered_write_iter+0x78/0x180
[  124.550701]  ext4_file_write_iter+0x44/0xf0
[  124.550702]  __kernel_write_iter+0x10c/0x2c0
[  124.550704]  dump_user_range+0xe0/0x240
[  124.550707]  elf_core_dump+0x4cc/0x538
[  124.550709]  do_coredump+0x574/0x988
[  124.550711]  get_signal+0x7dc/0x8f0
[  124.550713]  do_signal+0x138/0x1f8
[  124.550715]  do_notify_resume+0x114/0x298
[  124.550716]  el0_da+0xdc/0x178
[  124.550719]  el0t_64_sync_handler+0xdc/0x158
[  124.550721]  el0t_64_sync+0x1b0/0x1b8
[  124.550723] Mem-Info:
[  124.550728] active_anon:3921 inactive_anon:3473262 isolated_anon:0
                active_file:933 inactive_file:252531 isolated_file:0
                unevictable:609 dirty:241262 writeback:0
                slab_reclaimable:9234 slab_unreclaimable:35922
                mapped:3472425 shmem:3474488 pagetables:624
                sec_pagetables:0 bounce:0
                kernel_misc_reclaimable:0
                free:4031494 free_pcp:0 free_cma:48
[  124.550733] Node 0 active_anon:206656kB inactive_anon:222288768kB
active_file:1728kB inactive_file:15437504kB unevictable:9024kB
isolated(anon):0kB isolated(file):0kB mapped:222210880kB dirty:15437568kB
writeback:0kB shmem:222337216kB shmem_thp:0kB shmem_pmdmapped:0kB
anon_thp:0kB writeback_tmp:0
kB kernel_stack:51584kB shadow_call_stack:66368kB pagetables:38016kB
sec_pagetables:0kB all_unreclaimable? yes
[  124.550738] Node 0 DMA free:1041984kB boost:0kB min:69888kB low:87360kB
high:104832kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:393472kB unevictable:0kB
writepending:394112kB present:2097152kB managed:2029632kB mlocked:0kB
bounce:0kB free_pcp:0kB loca
l_pcp:0kB free_cma:3072kB
[  124.550742] lowmem_reserve[]: 0 0 15189 15189 15189
[  124.550747] Node 0 Normal free:8574848kB boost:0kB min:8575808kB
low:10719744kB high:12863680kB reserved_highatomic:0KB active_anon:206656kB
inactive_anon:222288768kegion 0:0x0000000022580000
B active_file:1728kB inactive_file:15044032kB unevictable:9024kB
writepending:15043456kB present:249244544kB managed:248932800kB mlocked:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB[  124.550750]
lowmem_reserve[]: 0 0 0 0 0
[  124.550754] Node 0 DMA: 5*64kB (ME) 4*128kB (ME) 1*256kB (U) 7*512kB
(UE) 5*1024kB (UMEC) 2*2048kB (UC) 3*4096kB (UME) 2*8192kB (ME) 3*16384kB
(UME) 3*32768kB (UME) 1*65536kB (U) 2*131072kB (UE) 2*262144kB (UE)
0*524288kB = 1041984kB
[  124.550769] Node 0 Normal: 726*64kB (UME) 392*128kB (UME) 246*256kB (UE)
138*512kB (UME) 65*1024kB (UE) 48*2048kB (UME) 19*4096kB (UE) 7*8192kB
(UME) 5*16384kB (U) 3*32768kB (UM) 2*65536kB (ME) 1*131072kB (E) 1*262144kB
(M) 14*524288kB (M) = 8574848kB
[  124.550786] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=16777216kB
[  124.550788] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=524288kB
[  124.550789] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
[  124.550790] 3729522 total pagecache pages
[  124.550792] 1406 pages in swap cache
[  124.550793] Free swap  = 0kB
[  124.550794] Total swap = 8388544kB
[  124.550795] 7858556 pages RAM
[  124.550796] 0 pages HighMem/MovableOnly
[  124.550796] 12342 pages reserved
[  124.550797] 8192 pages cma reserved
[  124.550798] 0 pages hwpoisoned

And here's /proc/meminfo from just before the crash:
MemTotal:       502157696 kB
MemFree:        258273600 kB
MemAvailable:   236229312 kB
Buffers:           29632 kB
Cached:         237187456 kB
SwapCached:      1374848 kB
Active:          9878912 kB
Inactive:       228723776 kB
Active(anon):    1307520 kB
Inactive(anon): 227959296 kB
Active(file):    8571392 kB
Inactive(file):   764480 kB
Unevictable:       38976 kB
Mlocked:           29952 kB
SwapTotal:       8388544 kB
SwapFree:        5436224 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:           8519168 kB
Writeback:       1250368 kB
AnonPages:         79424 kB
Mapped:         227857920 kB
Shmem:          227861632 kB
KReclaimable:     423680 kB
Slab:            2767232 kB
SReclaimable:     423680 kB
SUnreclaim:      2343552 kB
KernelStack:       93440 kB
ShadowCallStack:  121088 kB
PageTables:        40640 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    259467392 kB
Committed_AS:   231067456 kB
VmallocTotal:   137168158720 kB
VmallocUsed:      567680 kB
VmallocChunk:          0 kB
Percpu:           156672 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:         524288 kB
CmaFree:            3072 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:     524288 kB
Hugetlb:               0 kB

So while the number of ANON is much higher (due to how we setup the
reproducer), we can still cause the page allocation failures with enough
pressure on the LRU lists.

Could you answer the rest of my questions in the previous email?
>

Sure!
I did use those scripts on the LP bug to reproduce it successfully, with
the caveats I mentioned previously (only on aarch64, and easier on 64k
pages).
I landed on the mentioned fix commit by bisecting the upstream kernel
(Linus' tree), and confirmed the issue does not happen when cherry-picking
commit 1bc542c6a0d1 into Ubuntu kernels. I've validated this for Noble,
Oracular and Plucky.

Let me know if you need any more info on this!

  > [...]
>   > Also, did you confirm that the issue was resolved after applying the
> patch
>   > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
>   > list for ANON, not FILE.
>
> >
> > On Wed, Feb 12, 2025 at 1:37 AM Koichiro Den <koichiro.den at canonical.com
> >
> > wrote:
> >
> > > On Sun, Feb 02, 2025 at 12:21:50PM GMT, Heitor Alves de Siqueira wrote:
> > > > BugLink: https://bugs.launchpad.net/bugs/2097214
> > > >
> > > > [Impact]
> > > >  * On MGLRU-enabled systems, high memory pressure on NUMA nodes will
> > > cause page
> > > >    allocation failures
> > > >  * This happens due to page reclaim not waking up flusher threads
> > > >  * OOM can be triggered even if the system has enough available
> memory
> > > >
> > > > [Test Plan]
> > > >  * For the bug to properly trigger, we should uninstall apport and
> use
> > > the
> > > >    attached alloc_and_crash.c reproducer
> > > >  * alloc_and_crash will mmap a huge range of memory, memset it and
> > > forcibly SEGFAULT
> > > >  * The attached bash script will membind alloc_and_crash to NUMA
> node 0,
> > > so we
> > > >    can see the allocation failures in dmesg
> > > >    $ sudo apt remove --purge apport
> > > >    $ sudo dmesg -c; ./lp2097214-repro.sh; sleep 2; sudo dmesg
> > >
> > > I looked over the attached files (alloc_and_crash.c and
> > > lp2097214-repro.sh).
> > >
> > > Question:
> > > Did you use them to reproduce the issue that you want to resolve here?
> > > Also, did you confirm that the issue was resolved after applying the
> patch
> > > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> > > list for ANON, not FILE.
> > >
> > > >
> > > > [Fix]
> > > >  * The upstream patch wakes up flusher threads if there are too many
> > > dirty
> > > >    entries in the coldest LRU generation
> > > >  * This happens when trying to shrink lruvecs, so reclaim only gets
> > > woken up
> > > >    during high memory pressure
> > > >  * Fix was introduced by commit:
> > > >      1bc542c6a0d1 mm/vmscan: wake up flushers conditionally to avoid
> > > cgroup OOM
> > > >
> > > > [Regression Potential]
> > > >  * This commit fixes the memory reclaim path, so regressions would
> > > likely show
> > > >    up during increased system memory pressure
> > > >  * According to the upstream patch, increased SSD/disk wearing is
> > > possible due
> > > >    to waking up flusher threads, although these have not been noted
> in
> > > testing
> > > >
> > > > Zeng Jingxiang (1):
> > > >   mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
> > > >
> > > >  mm/vmscan.c | 25 ++++++++++++++++++++++---
> > > >  1 file changed, 22 insertions(+), 3 deletions(-)
> > > >
> > > > --
> > > > 2.48.1
> > > >
> > > >
> > > > --
> > > > kernel-team mailing list
> > > > kernel-team at lists.ubuntu.com
> > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250212/0b59fe95/attachment.html>