[X/B][PATCH 0/2] Improve TSC refinement (and calibration) reliability

Guilherme G. Piccoli gpiccoli at canonical.com
Sun May 10 16:24:59 UTC 2020


BugLink: https://bugs.launchpad.net/bugs/1877858

[Impact]
* We received a report recently of a missing TSC refinement across multiple
reboots of a server, in an Intel Skylake-based processor. This was only
reproducible in Bionic pre-5.0.

* After checking kernel commits, we came up with 2 commits that largely improve
the situation: a786ef152cdc ("x86/tsc: Make calibration refinement more
robust") [git.kernel.org/linus/a786ef152cdc] and 604dc9170f24 ("x86/tsc: Use
CPUID.0x16 to calculate missing crystal frequency")
[git.kernel.org/linus/604dc9170f24]. We hereby request SRU for both of them.

* The first commit contains improvement in comments and in an offset to match
more recent (fast) machines, but the important part is a retry mechanism in
the TSC refinement (in case it fails due to some disturbance on TSC read, like
NMIs/SMIs).

* The second commit is an improvement in TSC calibration for Skylake (and some
other models), by checking a register instead of relying on table-based
hardcoded values.

* A note for Xenial (kernel 4.4): the second patch would require the inclusion
of more commits, so given the "maturity" of this release (and the fact kernel
4.15 is an HWE for Xenial), I've kept it out of Xenial, backporting only the
first and more important patch for 4.4 .

[Test case]
* Unfortunately there's not an easy way to test the effectiveness of the
commits, specially the refinement improvement.

* The user that reported us the missing refinements was able to test 300
reboots with a regular Bionic kernel (and it reproduced the issue at least
once), whereas when they tested with Bionic kernel + both hereby proposed
commits, the problem didn't happen.

* Regarding the calibration commit, it was well-tested by community using
multiple machines and checking the TSC calibration read vs. tables present
in instlatx64.atw.hu .

[Regression potential]
* We consider the regression potential low, specially due to the nature of the
patches: the first is basically a retry mechanism (and some improvement in an
offset to reflect more recent machines), and the 2nd is an improvement for TSC
calibration on some platforms (that are currently hardcoded in a table-based
way in kernel). Also, the patches are present upstream for a while and I
couldn't find any fixes for them.

* An hypothetical regression from the 2nd patch could be in TSC precision
calculation, which refinement itself might as well circumvent. From the first
patch, a bug in code is the one hypothetical regression I could think.

Daniel Drake (1):
  x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency

Daniel Vacek (1):
  x86/tsc: Make calibration refinement more robust

 arch/x86/kernel/tsc.c | 77 ++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 34 deletions(-)

-- 
2.25.2



More information about the kernel-team mailing list