Cmnt: [SRU][J][PATCH 0/2] Fix bugs preventing boot on Intel TDX-enabled hosts

Tue Feb 11 11:42:28 UTC 2025

On Mon, Feb 10, 2025 at 07:13:32PM GMT, Ian Whitfield wrote:
> BugLink: https://bugs.launchpad.net/bugs/2097811
> 
> SRU Justification:
> 
> [Impact]
> 
> Google has requested these upstream commits be applied in order to fix
> bugs preventing the boot of 5.15 kernel instances on their Intel TDX
> enabled infrastructure.
> 
> These patches aim to resolve problems with incorrect assessment of the
> CPU's address width in bits on x86, mostly during boot.
> 
> [Fix]
> 
> The first patch applied cleanly. The second patch had a large number of
> unrelated conflicts resolved by adjusting the context around the changes
> in the patch. One conflict did have a direct impact on the patch, but it
> was resolved by tracing where a function call had been moved, and making
> the original changes there.
> 
> This patchset was originally targeting the jammy:linux-gcp kernel, but
> the same problem exists in the generic kernel. For this reason, a
> separate thread was made for each kernel such that linux-gcp can get the
> patches early and after the generic patch window has already closed, but
> the same patches can be reviewed and applied to generic to fix the same
> bugs, at a later time.
> 
> [Test Plan]
> 
> Google reported inability to boot Focal images (which use a backport of
> this kernel) on a specific configuration in a deployment zone where
> Intel TDX was enabled. This patchset was tested by booting a Jammy image
> on one such machine (which uses the 6.8 kernel), installing this patched
> kernel, and booting into it. Before this patch is applied, the installed
> kernel doesn't finish the boot process, and after the patch is applied,
> it boots as normal.
> 
> [Where problems could occur]
> 
> As these changes affect booting and the kernel's understanding of the
> cpu, an error in the backporting of these patches could cause the user
> to be unable to boot the kernel. Risk of an error is relatively low due
> to the first patch applying cleanly and the second patch only needing
> modification in the MTRR cleanup feature, which could be disabled with
> a kernel command line parameter. If the fixes don't work, we would see
> the kernel continue to not be bootable on TDX-enabled hosts.
> 
> Juergen Gross (1):
>   x86/mtrr: Remove physical address size calculation
> 
> Paolo Bonzini (1):
>   x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()
> 
>  arch/x86/kernel/cpu/common.c       |  2 +
>  arch/x86/kernel/cpu/mtrr/cleanup.c | 16 ++++----
>  arch/x86/kernel/cpu/mtrr/generic.c | 12 +++++-
>  arch/x86/kernel/cpu/mtrr/mtrr.c    | 61 ++++--------------------------
>  arch/x86/kernel/cpu/mtrr/mtrr.h    |  4 +-
>  5 files changed, 31 insertions(+), 64 deletions(-)
> 

The seemingly clean cherry-pick [1/1] appears to mess up the early_identify_cpu().
The first half of the old two-phased setup remains and also there are double
get_cpu_address_sizes() invocations before/after
setup_force_cpu_cap(X86_FEATURE_CPUID). As an aside, one of #VC handler issues
seems to be remaining as well and it would surprise someone in the future.

Did you intentionally avoid backporting fbf6449f84bf ("x86/sev-es: Set
x86_virt_bits to the correct value straight away, instead of a two-phase
approach") as a prerequisite?

If so, can you explain the reason for it somewhere (ML or provenance section)
especially since you're aiming generic master kernel and such delicate part
of the code (that historically has been fixed multiple times) would diverge
from any upstream revision. I would backport fbf6449f84bf as prerequisite though.

Thanks.