[SRU][F:linux-bluefield][PATCH v1 0/1] UBUNTU: SAUCE: Fix OOB handling RX packets in heavy traffic
Asmaa Mnebhi
asmaa at nvidia.com
Tue Mar 15 16:35:23 UTC 2022
BugLink: https://bugs.launchpad.net/bugs/1964984
SRU Justification:
[Impact]
This is reproducible on systems which already have heavy background
traffic. On top of that, the user issues one of the 2 docker pulls below:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa
The second one is a very large container (17GB)
When they run docker pull, the OOB interface stops being pingable,
the docker pull is interrupted for a very long time (>3mn) or
times out.
[Fix]
* Update the RX_CQE_CI before updating the RX_PI to avoid a race condition where we wrongly inform HW that there is space for the WQE.
* disable the RX DMA while we are handling incoming packets to avoid overflow.
[Test Case]
* Created a script which loops 200 times and does a docker pull in each loop:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa
[Regression Potential]
* This could result in slower handling since we are disabling/enabling the DMA periodically.
* Although this fix has been tested by the people who opened the bug, QA needs to thoroughly test it to make sure it is not reproducible.
More information about the kernel-team
mailing list