NFS Performance Problems, Ubuntu 20.04 / 22.04

Thu Nov 24 20:30:31 UTC 2022

Hello,

I have been investigating a NFS Server Performance Problem the past few days and can not make any sense of it at the moment.

==== Our setup is roughly as follows:
Storage Server [1,2] --> Storage Switch --> Cluster Node --> NFS Server VM --> NFS Client (VM on same host or different physical Device connected over LAN that is not the Storage network)
Storage Server 1 where VM root disks come from.
Storage Server 2 is where user data lives.
The storage server provides iscsi to the cluster node.
The clusternode uses this as PV for LVM, has a LV on top of it which is encrypted using luks.
The NFS Client VM has the resulting blockdevice passed in, has it formatted as ext4 and provides multiple directories originating from it to the network.

==== The performance Problem manifests as follows:
1)
When running a "time du -shx" on the NFS Server VM directly on the mounted storage block device, it finished in sub-second time.
Doing the same over NFS, be it either another physical server on the network, a VM on the same host, or even the NFS VM mounting the directory itself, the same operation takes around roughly 30 seconds.
(The directory is about 4GB in space and contains around 100k files, other directories have been tested and exhibit the same problems)
(I later also tested a directory created as "mkdir testdir{1..10}; touch testdir{1..10}/testfile{1..1000}", sub-second when accessed directly, 4 seconds when accessed over NFS).
2)
It seems i get at most 30MB/s throughput out of the NFS Server where I would expect around a 100MB/s.
The network itself seems to be able to handle 1Gbit/s find according to iperf3 testing.

==== I tried to diagnose the problem:
I have read several "NFS Performance Tuning" guides, played around with parameters such as rsize,wsize,no_subtree_check,root_squash,sync/async,...
None of it seemed to have major impact on performance.
According to "dstat" there is a lot of IO wait on the NFS Server, while nothing particular wierd is visible on NFS Clients, Cluster Node or Storage Server.

Since the NFS Server VM is somewhat in production use, i replicated the setup in a second VM, started with Ubuntu 20.04 and upgraded that to Ubuntu 22.04.
In all cases the performance issue has not changed.

I tried "systemctl stop apparmor", "aa-teardown" or even extending the kernel command line with "apparmor=0", with no noticeable difference.

I tested that the problem persists regardless if the underlying storage comes from a different storage server (Storage Server 1 is a HP P2000, Storage Server 2 is a Synology Rack Station).
With the test NFS Server serving from either Storage Server, the problem persists.
On recommendation of a colleague, i tested taking the storage servers out the calculation, so i used a qcow2 formated image stored in a tmpfs, passed it to the VM, formatted it as ext4 inside and used it as NFS export, same problem.

After some time when cross-testing, I noticed that a NFS Server running on a Debian 11 serving from Storage Server 1 does not exhibit the problem, so I currently suspect an Ubuntu specific problem.

Any suggestions where to look for the issue specifically or how to solve it?
If there is a better place to ask this, please direct me towards it.

thanks in advance
Theodor