<div dir="ltr"><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Wed, Feb 12, 2025 at 9:36 AM Koichiro Den <<a href="mailto:koichiro.den@canonical.com">koichiro.den@canonical.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, Feb 12, 2025 at 09:14:27AM GMT, Heitor Alves de Siqueira wrote:<br>
> Hi Koichiro,<br>
> <br>
> thanks for looking into this! Yes, I've used the attached scripts to<br>
> reproduce the issue successfully, although only in aarch64 systems<br>
> (specifically, I've used Grace-Grace for my tests).<br>
> I've not been able to reproduce this reliably in x86 or other<br>
> architectures, and using 64k page sizes also makes this much faster/easier<br>
> to reproduce.<br>
<br>
Thanks for the reply. Just let me confirm; when you verified that you<br>
reproduced it, you confirmed that there were large number of dirty folios<br>
in the LRU list for the coldest gen for FILE (not ANON), right?</blockquote><div> </div><div>Here's a stack trace from the latest reproducer run I did earlier this morning, using kernel 6.8.0-53-generic-64k from Noble:</div><div><br></div><div>[ 124.550628] alloc_and_crash: page allocation failure: order:0, mode:0x141cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_WRITE), nodemask=0,cpuset=/,mems_allowed=0-1ion 0:0x00000000221c0000<br>[ 124.550648] CPU: 135 PID: 3406 Comm: alloc_and_crash Not tainted 6.8.0-53-generic-64k #55-Ubuntu<br>[ 124.550651] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023<br>[ 124.550653] Call trace:<br>[ 124.550656] dump_backtrace+0xa4/0x150<br>[ 124.550665] show_stack+0x24/0x50<br>[ 124.550667] dump_stack_lvl+0xc8/0x138<br>[ 124.550671] dump_stack+0x1c/0x38<br>[ 124.550672] warn_alloc+0x16c/0x1f0<br>[ 124.550677] __alloc_pages_slowpath.constprop.0+0x8e4/0x9f0<br>[ 124.550679] __alloc_pages+0x2f0/0x3a8<br>[ 124.550680] alloc_pages_mpol+0x94/0x290<br>[ 124.550685] alloc_pages+0x6c/0x118<br>[ 124.550687] folio_alloc+0x24/0x98<br>[ 124.550689] filemap_alloc_folio+0x168/0x188<br>[ 124.550692] __filemap_get_folio+0x1bc/0x3f8<br>[ 124.550694] ext4_da_write_begin+0x144/0x300<br>[ 124.550697] generic_perform_write+0xc4/0x228<br>[ 124.550699] ext4_buffered_write_iter+0x78/0x180<br>[ 124.550701] ext4_file_write_iter+0x44/0xf0<br>[ 124.550702] __kernel_write_iter+0x10c/0x2c0<br>[ 124.550704] dump_user_range+0xe0/0x240<br>[ 124.550707] elf_core_dump+0x4cc/0x538<br>[ 124.550709] do_coredump+0x574/0x988<br>[ 124.550711] get_signal+0x7dc/0x8f0<br>[ 124.550713] do_signal+0x138/0x1f8<br>[ 124.550715] do_notify_resume+0x114/0x298<br>[ 124.550716] el0_da+0xdc/0x178<br>[ 124.550719] el0t_64_sync_handler+0xdc/0x158<br>[ 124.550721] el0t_64_sync+0x1b0/0x1b8<br>[ 124.550723] Mem-Info:<br>[ 124.550728] active_anon:3921 inactive_anon:3473262 isolated_anon:0<br> active_file:933 inactive_file:252531 isolated_file:0<br> unevictable:609 dirty:241262 writeback:0<br> slab_reclaimable:9234 slab_unreclaimable:35922<br> mapped:3472425 shmem:3474488 pagetables:624<br> sec_pagetables:0 bounce:0<br> kernel_misc_reclaimable:0<br> free:4031494 free_pcp:0 free_cma:48<br>[ 124.550733] Node 0 active_anon:206656kB inactive_anon:222288768kB active_file:1728kB inactive_file:15437504kB unevictable:9024kB isolated(anon):0kB isolated(file):0kB mapped:222210880kB dirty:15437568kB writeback:0kB shmem:222337216kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0<br>kB kernel_stack:51584kB shadow_call_stack:66368kB pagetables:38016kB sec_pagetables:0kB all_unreclaimable? yes<br>[ 124.550738] Node 0 DMA free:1041984kB boost:0kB min:69888kB low:87360kB high:104832kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:393472kB unevictable:0kB writepending:394112kB present:2097152kB managed:2029632kB mlocked:0kB bounce:0kB free_pcp:0kB loca<br>l_pcp:0kB free_cma:3072kB<br>[ 124.550742] lowmem_reserve[]: 0 0 15189 15189 15189<br>[ 124.550747] Node 0 Normal free:8574848kB boost:0kB min:8575808kB low:10719744kB high:12863680kB reserved_highatomic:0KB active_anon:206656kB inactive_anon:222288768kegion 0:0x0000000022580000<br>B active_file:1728kB inactive_file:15044032kB unevictable:9024kB writepending:15043456kB present:249244544kB managed:248932800kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB[ 124.550750] lowmem_reserve[]: 0 0 0 0 0<br>[ 124.550754] Node 0 DMA: 5*64kB (ME) 4*128kB (ME) 1*256kB (U) 7*512kB (UE) 5*1024kB (UMEC) 2*2048kB (UC) 3*4096kB (UME) 2*8192kB (ME) 3*16384kB (UME) 3*32768kB (UME) 1*65536kB (U) 2*131072kB (UE) 2*262144kB (UE) 0*524288kB = 1041984kB<br>[ 124.550769] Node 0 Normal: 726*64kB (UME) 392*128kB (UME) 246*256kB (UE) 138*512kB (UME) 65*1024kB (UE) 48*2048kB (UME) 19*4096kB (UE) 7*8192kB (UME) 5*16384kB (U) 3*32768kB (UM) 2*65536kB (ME) 1*131072kB (E) 1*262144kB (M) 14*524288kB (M) = 8574848kB<br>[ 124.550786] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB<br>[ 124.550788] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB<br>[ 124.550789] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB<br>[ 124.550790] 3729522 total pagecache pages<br>[ 124.550792] 1406 pages in swap cache<br>[ 124.550793] Free swap = 0kB<br>[ 124.550794] Total swap = 8388544kB<br>[ 124.550795] 7858556 pages RAM<br>[ 124.550796] 0 pages HighMem/MovableOnly<br>[ 124.550796] 12342 pages reserved<br>[ 124.550797] 8192 pages cma reserved<br>[ 124.550798] 0 pages hwpoisoned<br></div><div><br></div><div>And here's /proc/meminfo from just before the crash:</div><div>MemTotal: 502157696 kB<br>MemFree: 258273600 kB<br>MemAvailable: 236229312 kB<br>Buffers: 29632 kB<br>Cached: 237187456 kB<br>SwapCached: 1374848 kB<br>Active: 9878912 kB<br>Inactive: 228723776 kB<br>Active(anon): 1307520 kB<br>Inactive(anon): 227959296 kB<br>Active(file): 8571392 kB<br>Inactive(file): 764480 kB<br>Unevictable: 38976 kB<br>Mlocked: 29952 kB<br>SwapTotal: 8388544 kB<br>SwapFree: 5436224 kB<br>Zswap: 0 kB<br>Zswapped: 0 kB<br>Dirty: 8519168 kB<br>Writeback: 1250368 kB<br>AnonPages: 79424 kB<br>Mapped: 227857920 kB<br>Shmem: 227861632 kB<br>KReclaimable: 423680 kB<br>Slab: 2767232 kB<br>SReclaimable: 423680 kB<br>SUnreclaim: 2343552 kB<br>KernelStack: 93440 kB<br>ShadowCallStack: 121088 kB<br>PageTables: 40640 kB<br>SecPageTables: 0 kB<br>NFS_Unstable: 0 kB<br>Bounce: 0 kB<br>WritebackTmp: 0 kB<br>CommitLimit: 259467392 kB<br>Committed_AS: 231067456 kB<br>VmallocTotal: 137168158720 kB<br>VmallocUsed: 567680 kB<br>VmallocChunk: 0 kB<br>Percpu: 156672 kB<br>HardwareCorrupted: 0 kB<br>AnonHugePages: 0 kB<br>ShmemHugePages: 0 kB<br>ShmemPmdMapped: 0 kB<br>FileHugePages: 0 kB<br>FilePmdMapped: 0 kB<br>CmaTotal: 524288 kB<br>CmaFree: 3072 kB<br>HugePages_Total: 0<br>HugePages_Free: 0<br>HugePages_Rsvd: 0<br>HugePages_Surp: 0<br>Hugepagesize: 524288 kB<br>Hugetlb: 0 kB</div><div><br></div><div>So while the number of ANON is much higher (due to how we setup the reproducer), we can still cause the page allocation failures with enough pressure on the LRU lists.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Could you answer the rest of my questions in the previous email?<br></blockquote><div><br></div><div>Sure!</div><div>I did use those scripts on the LP bug to reproduce it successfully, with the caveats I mentioned previously (only on aarch64, and easier on 64k pages).</div><div>I landed on the mentioned fix commit by bisecting the upstream kernel (Linus' tree), and confirmed the issue does not happen when cherry-picking commit 1bc542c6a0d1 into Ubuntu kernels. I've validated this for Noble, Oracular and Plucky.</div><div><br></div><div>Let me know if you need any more info on this!</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> [...]<br>
> Also, did you confirm that the issue was resolved after applying the patch<br>
> for Noble/Oracular/Plucky? It seems to me that it's just stressing lru<br>
> list for ANON, not FILE.<br>
<br>
> <br>
> On Wed, Feb 12, 2025 at 1:37 AM Koichiro Den <<a href="mailto:koichiro.den@canonical.com" target="_blank">koichiro.den@canonical.com</a>><br>
> wrote:<br>
> <br>
> > On Sun, Feb 02, 2025 at 12:21:50PM GMT, Heitor Alves de Siqueira wrote:<br>
> > > BugLink: <a href="https://bugs.launchpad.net/bugs/2097214" rel="noreferrer" target="_blank">https://bugs.launchpad.net/bugs/2097214</a><br>
> > ><br>
> > > [Impact]<br>
> > > * On MGLRU-enabled systems, high memory pressure on NUMA nodes will<br>
> > cause page<br>
> > > allocation failures<br>
> > > * This happens due to page reclaim not waking up flusher threads<br>
> > > * OOM can be triggered even if the system has enough available memory<br>
> > ><br>
> > > [Test Plan]<br>
> > > * For the bug to properly trigger, we should uninstall apport and use<br>
> > the<br>
> > > attached alloc_and_crash.c reproducer<br>
> > > * alloc_and_crash will mmap a huge range of memory, memset it and<br>
> > forcibly SEGFAULT<br>
> > > * The attached bash script will membind alloc_and_crash to NUMA node 0,<br>
> > so we<br>
> > > can see the allocation failures in dmesg<br>
> > > $ sudo apt remove --purge apport<br>
> > > $ sudo dmesg -c; ./lp2097214-repro.sh; sleep 2; sudo dmesg<br>
> ><br>
> > I looked over the attached files (alloc_and_crash.c and<br>
> > lp2097214-repro.sh).<br>
> ><br>
> > Question:<br>
> > Did you use them to reproduce the issue that you want to resolve here?<br>
> > Also, did you confirm that the issue was resolved after applying the patch<br>
> > for Noble/Oracular/Plucky? It seems to me that it's just stressing lru<br>
> > list for ANON, not FILE.<br>
> ><br>
> > ><br>
> > > [Fix]<br>
> > > * The upstream patch wakes up flusher threads if there are too many<br>
> > dirty<br>
> > > entries in the coldest LRU generation<br>
> > > * This happens when trying to shrink lruvecs, so reclaim only gets<br>
> > woken up<br>
> > > during high memory pressure<br>
> > > * Fix was introduced by commit:<br>
> > > 1bc542c6a0d1 mm/vmscan: wake up flushers conditionally to avoid<br>
> > cgroup OOM<br>
> > ><br>
> > > [Regression Potential]<br>
> > > * This commit fixes the memory reclaim path, so regressions would<br>
> > likely show<br>
> > > up during increased system memory pressure<br>
> > > * According to the upstream patch, increased SSD/disk wearing is<br>
> > possible due<br>
> > > to waking up flusher threads, although these have not been noted in<br>
> > testing<br>
> > ><br>
> > > Zeng Jingxiang (1):<br>
> > > mm/vmscan: wake up flushers conditionally to avoid cgroup OOM<br>
> > ><br>
> > > mm/vmscan.c | 25 ++++++++++++++++++++++---<br>
> > > 1 file changed, 22 insertions(+), 3 deletions(-)<br>
> > ><br>
> > > --<br>
> > > 2.48.1<br>
> > ><br>
> > ><br>
> > > --<br>
> > > kernel-team mailing list<br>
> > > <a href="mailto:kernel-team@lists.ubuntu.com" target="_blank">kernel-team@lists.ubuntu.com</a><br>
> > > <a href="https://lists.ubuntu.com/mailman/listinfo/kernel-team" rel="noreferrer" target="_blank">https://lists.ubuntu.com/mailman/listinfo/kernel-team</a><br>
> ><br>
</blockquote></div></div>