Wed Dec 18 18:57:50 UTC 2019

BugLink: https://bugs.launchpad.net/bugs/1855409


* The PTP feature in qede driver is implemented in a way that if the NIC
firmware takes some time to perform the timestamping then the PTP worker
function will reschedule itself indefinitely until the value read from a
device register is meaningful. With that behavior, if an userspace tool
requests a bad configured TX/RX filter (or if NIC firmware has any other
issue in timestamping), the function qede_ptp_task() will reschedule itself
forever and cause an unbound resource consumption. This manifests as a
kworker thread consuming 100% of CPU.

* The dmesg log will show a message like this:
"qede_ptp_tx_ts:533(eno3)]Timestamping in progress"

Also, by using perf user can observe a stack like the following:
- 44.76% 0.00% kworker/16:5 [kernel.kallsyms]
   - kthread
      - 44.74% worker_thread
         - 44.57% process_one_work
            - 42.67% qede_ptp_task
               - 38.86% qed_ptp_hw_read_tx_ts
               - 3.03% queue_work_on
                  - 2.06% __queue_work
                     - 0.68% get_work_pool
                        - 0.61% radix_tree_lookup
              0.50% set_work_pool_and_clear_pending

* The patch proposed in this SRU request refactors the PTP worked in qede by
adding a time limit, after which the task doesn't reschedule itself anymore,
failing the timestamp procedure:
9adebac37e7d ("qede: Handle infinite driver spinning for Tx timestamp.")

Besides fixing the issue, it also adds an ethtool statistics for accounting
the PTP errors.

[Test case]

By using chrony in Bionic, the following steps will reproduce the issue:

a) Install chrony on Bionic in a system with working NIC managed by qede;
b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf
c) Restart chrony service

Check dmesg for the "[...]Timestamping in progress" message and the
overall CPU workload using a tool like "top" to observe a kthread
consuming 100% of CPU.

[Regression potential]

The patch scope is restricted to qede PTP handler, and is upstream for more
than 7 months. If there's any possibility of regressions, the worst would
be an issue affecting the packet timestamping, not messing with the regular
xmit path of the driver.

Sudarsana Reddy Kalluru (1):
  qede: Handle infinite driver spinning for Tx timestamp.

 drivers/net/ethernet/qlogic/qede/qede.h       |  2 +
 .../net/ethernet/qlogic/qede/qede_ethtool.c   |  2 +
 drivers/net/ethernet/qlogic/qede/qede_main.c  |  4 ++
 drivers/net/ethernet/qlogic/qede/qede_ptp.c   | 37 +++++++++++++++----
 4 files changed, 38 insertions(+), 7 deletions(-)


