APPLIED: [SRU][Disco][PATCH 0/1] NFSv4.1: Interrupted connections cause high bandwidth RPC ping-pong between client and server

Fri Nov 8 17:32:18 UTC 2019

On 2019-10-30 12:38:14 , Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1828978
> 
> [Impact]
> 
> There is a bug in NFS v4.1 that causes a large amount of RPC calls between a 
> client and server when a previous RPC call is interrupted. This uses a large 
> amount of bandwidth and can saturate the network.
> 
> The symptoms are so:
> 
> * On NFS clients:
> Attempts to access mounted NFS shares associated with the affected server block
> indefinitely.
>  
> * On the network:
> A storm of repeated RPCs between NFS client and server uses a lot of bandwidth.
> Each RPC is acknoledged by the server with an NFS4ERR_SEQ_MISORDERED error.
> 
> * Other NFS clients connected to the same NFS server:
> Performance drops dramatically.
> 
> This occurs during a "false retry", when a client attempts to make a new RPC 
> call using a slot+sequence number that references an older, cached call. This 
> happens when a user process interrupts an RPC call that is in progress.
> 
> [Fix]
> 
> This was fixed in 5.1 upstream with the below commit:
> 
> commit 3453d5708b33efe76f40eca1c0ed60923094b971
> Author: Trond Myklebust <trond.myklebust at hammerspace.com>
> Date:   Wed Jun 20 17:53:34 2018 -0400
> Subject: NFSv4.1: Avoid false retries when RPC calls are interrupted
> 
> The fix is to pre-emptively increment the sequence number if an RPC call is 
> interrupted, and to address corner cases we interpret the NFS4ERR_SEQ_MISORDERED
> error as a sign we need to locate an approperiate sequence number between the 
> value we sent, and the last successfully acked SEQUENCE call.
> 
> Commit 3453d5708b33efe76f40eca1c0ed60923094b971 is a clean cherry-pick to disco.
> 
> [Testcase]
> 
> This is difficult to reproduce on test systems, and has instead been verified on
> a production NFS v4.1 system in a customer environment. This server is heavily 
> trafficked and has a large number of different NFS clients connected to it.
> 
> I have built a test kernel that contains the above patch, and also patches for 
> Bug 1842037. It is available here:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf241068-test
> 
> Note that the above kernel is for bionic HWE, and not explicitly disco.
> 
> Discussion about the patch validation can be found at the bottom of Bug 1842037.
> 
> On unpatched kernels, expect to see the symptoms mentioned in Impact, and on 
> patched systems, everything working as intended.
> 
> [Regression Potential]
> 
> The changes are localised to NFS v4.1 only, and other versions of NFS are not 
> affected. If a regression occurs, users can downgrade NFS versions to v4.0 or 
> v3.x until a fix is made.
> 
> The changes only impact when connections are interrupted, and under typical blue
> sky scenarios would not be invoked. 
> 
> There have been no fixup commits or commits near the requested commit in newer 
> kernels, which points to this commit fixing the issue, and adopted by the 
> community.
> 
> Trond Myklebust (1):
>   NFSv4.1: Avoid false retries when RPC calls are interrupted
> 
>  fs/nfs/nfs4proc.c    | 105 ++++++++++++++++++++-----------------------
>  fs/nfs/nfs4session.c |   5 ++-
>  fs/nfs/nfs4session.h |   5 ++-
>  3 files changed, 55 insertions(+), 60 deletions(-)
> 
> -- 
> 2.20.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team