[Bug 2083502] Re: NFS4.2 crashes with high IO load after some hours

Peter Schubert 2083502 at bugs.launchpad.net
Tue Nov 5 07:59:39 UTC 2024


*** This bug is a duplicate of bug 2062568 ***
    https://bugs.launchpad.net/bugs/2062568

For everyone who has the same problem and who absolutely needs to use
NFS4.1 or NFS4.2, there seems to be a simple solution. If the NFS server
and the NFS client use the current 5.15 kernel, the crashes don't seem
to occur so far.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to nfs-utils in Ubuntu.
https://bugs.launchpad.net/bugs/2083502

Title:
  NFS4.2 crashes with high IO load after some hours

Status in nfs-utils package in Ubuntu:
  New

Bug description:
  Since August 19th we have been struggling with irregular crashes on
  our NFS server.

  Our experiences with NFS server crashes are:
  - We were able to reproduce the crashes in our production and test environments with NFS4.2. Rarely after minutes, sometimes after hours or days, but sometimes not at all,
    as we often stopped the experiments after 12 to 24 hours.
  - We have not yet been able to reproduce a crash between a bare metal NFS server and a bare metal NFS client, but between a bare metal NFS server and a virtualized client with NFS4.2.
  - we could not reproduce a crash with NFS vers=4.0 up to now
  - we now running NFS vers=4.1 since some hours to see if this helps to get the system stable
  - the crashs happens with or without GSSPROXY
  - before Sept 15 the kernel back on the NFS server trace allways started with:
   watchdog: BUG: soft lockup - CPU#23 stuck for 26s! [kworker/u483:0:8805]
   and after Sept 15 with:
   rcu: INFO: rcu_sched self-detected stall on CPU
  - changing the kernel on the client from 6.8.0-40-generic to the unofficial 6.5.0-46-generic from Mehmet Basaran (mehmetbasaran) only removed the backchannel error from the client,
    but the server still hangs with "rcu: INFO: rcu_sched self-detected stall on CPU"
  - changing the client kernel from 6.8.0-40-generic back to 6.5.0-44-generic does not solved the problem that user can not login after the crash:
   client:
    kernel: [29361.795714] INFO: task python3:107226 blocked for more than 120 seconds.
  - changing also the server kernel back from 5.15.0-122-generic to 5.15.0-112-generic only changed the error message but not the stability:
   server:
    kernel: [82884.774039] INFO: task split:8351 blocked for more than 120 seconds.

  Our setup:
  - virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 5.15.0-122-generic
  - virtualized NFS clients with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or kernel 6.8.0-45-generic
  - /etc/exports :  /mnt/home  nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check)
  - /etc/fstab :  nfsserver:/mnt/home /home   nfs    vers=4.2,rw,soft,sec=krb5,proto=tcp  0  0
  - apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2

  # NFS Server error message with mainline kernel 5.15.0-122-generic :
  Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected stall on CPU
  Sep 30 01:15:51 nfs-server.domain.de kernel: rcu:         54-....: (14998 ticks this GP) idle=2db/1/0x4000000000000000 softirq=32173387/32173387 fqs=7449
  Sep 30 01:15:51 nfs-server.domain.de kernel:         (t=15000 jiffies g=144775177 q=49782)
  Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54
  Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu
  Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
  Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc]
  Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace:
  Sep 30 01:15:51 nfs-server.domain.de kernel:  <IRQ>
  Sep 30 01:15:51 nfs-server.domain.de kernel:  show_stack+0x52/0x5c
  Sep 30 01:15:51 nfs-server.domain.de kernel:  dump_stack_lvl+0x4a/0x63
  Sep 30 01:15:51 nfs-server.domain.de kernel:  dump_stack+0x10/0x16
  Sep 30 01:15:51 nfs-server.domain.de kernel:  nmi_cpu_backtrace.cold+0x4d/0x93
  Sep 30 01:15:51 nfs-server.domain.de kernel:  ? lapic_can_unplug_cpu+0x90/0x90
  Sep 30 01:15:51 nfs-server.domain.de kernel:  nmi_trigger_cpumask_backtrace+0xec/0x100
  Sep 30 01:15:51 nfs-server.domain.de kernel:  arch_trigger_cpumask_backtrace+0x19/0x20
  Sep 30 01:15:51 nfs-server.domain.de kernel:  trigger_single_cpu_backtrace+0x44/0x4f
  Sep 30 01:15:51 nfs-server.domain.de kernel:  rcu_dump_cpu_stacks+0x102/0x149
  Sep 30 01:15:51 nfs-server.domain.de kernel:  print_cpu_stall.cold+0x2f/0xe2
  Sep 30 01:15:51 nfs-server.domain.de kernel:  check_cpu_stall+0x1d8/0x270
  Sep 30 01:15:51 nfs-server.domain.de kernel:  rcu_sched_clock_irq+0x9a/0x250
  Sep 30 01:15:51 nfs-server.domain.de kernel:  update_process_times+0x94/0xd0
  Sep 30 01:15:51 nfs-server.domain.de kernel:  tick_sched_handle+0x29/0x70
  Sep 30 01:15:51 nfs-server.domain.de kernel:  tick_sched_timer+0x6f/0x90
  Sep 30 01:15:51 nfs-server.domain.de kernel:  ? tick_sched_do_timer+0xa0/0xa0
  Sep 30 01:15:51 nfs-server.domain.de kernel:  __hrtimer_run_queues+0x104/0x230
  Sep 30 01:15:51 nfs-server.domain.de kernel:  ? read_hv_clock_tsc_cs+0x9/0x30
  Sep 30 01:15:51 nfs-server.domain.de kernel:  hrtimer_interrupt+0x101/0x220
  Sep 30 01:15:51 nfs-server.domain.de kernel:  hv_stimer0_isr+0x1d/0x30
  Sep 30 01:15:51 nfs-server.domain.de kernel:  __sysvec_hyperv_stimer0+0x2f/0x70
  Sep 30 01:15:51 nfs-server.domain.de kernel:  sysvec_hyperv_stimer0+0x7b/0x90
  Sep 30 01:15:51 nfs-server.domain.de kernel:  </IRQ>
  Sep 30 01:15:51 nfs-server.domain.de kernel:  <TASK>
  Sep 30 01:15:51 nfs-server.domain.de kernel:  asm_sysvec_hyperv_stimer0+0x1b/0x20
  Sep 30 01:15:51 nfs-server.domain.de kernel: RIP: 0010:read_hv_clock_tsc+0x1b/0x6

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2083502/+subscriptions




More information about the foundations-bugs mailing list