Cmnt: [SRU][J][PATCH 0/2] Fix bugs preventing boot on Intel TDX-enabled hosts

Tue Feb 11 23:53:22 UTC 2025

On Tue, Feb 11, 2025 at 08:42:28PM +0900, Koichiro Den wrote:
> On Mon, Feb 10, 2025 at 07:13:32PM GMT, Ian Whitfield wrote:
> > BugLink: https://bugs.launchpad.net/bugs/2097811
> > 
> > SRU Justification:
> > 
> > [Impact]
> > 
> > Google has requested these upstream commits be applied in order to fix
> > bugs preventing the boot of 5.15 kernel instances on their Intel TDX
> > enabled infrastructure.
> > 
> > These patches aim to resolve problems with incorrect assessment of the
> > CPU's address width in bits on x86, mostly during boot.
> > 
> > [Fix]
> > 
> > The first patch applied cleanly. The second patch had a large number of
> > unrelated conflicts resolved by adjusting the context around the changes
> > in the patch. One conflict did have a direct impact on the patch, but it
> > was resolved by tracing where a function call had been moved, and making
> > the original changes there.
> > 
> > This patchset was originally targeting the jammy:linux-gcp kernel, but
> > the same problem exists in the generic kernel. For this reason, a
> > separate thread was made for each kernel such that linux-gcp can get the
> > patches early and after the generic patch window has already closed, but
> > the same patches can be reviewed and applied to generic to fix the same
> > bugs, at a later time.
> > 
> > [Test Plan]
> > 
> > Google reported inability to boot Focal images (which use a backport of
> > this kernel) on a specific configuration in a deployment zone where
> > Intel TDX was enabled. This patchset was tested by booting a Jammy image
> > on one such machine (which uses the 6.8 kernel), installing this patched
> > kernel, and booting into it. Before this patch is applied, the installed
> > kernel doesn't finish the boot process, and after the patch is applied,
> > it boots as normal.
> > 
> > [Where problems could occur]
> > 
> > As these changes affect booting and the kernel's understanding of the
> > cpu, an error in the backporting of these patches could cause the user
> > to be unable to boot the kernel. Risk of an error is relatively low due
> > to the first patch applying cleanly and the second patch only needing
> > modification in the MTRR cleanup feature, which could be disabled with
> > a kernel command line parameter. If the fixes don't work, we would see
> > the kernel continue to not be bootable on TDX-enabled hosts.
> > 
> > Juergen Gross (1):
> >   x86/mtrr: Remove physical address size calculation
> > 
> > Paolo Bonzini (1):
> >   x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()
> > 
> >  arch/x86/kernel/cpu/common.c       |  2 +
> >  arch/x86/kernel/cpu/mtrr/cleanup.c | 16 ++++----
> >  arch/x86/kernel/cpu/mtrr/generic.c | 12 +++++-
> >  arch/x86/kernel/cpu/mtrr/mtrr.c    | 61 ++++--------------------------
> >  arch/x86/kernel/cpu/mtrr/mtrr.h    |  4 +-
> >  5 files changed, 31 insertions(+), 64 deletions(-)
> > 
> 
> The seemingly clean cherry-pick [1/1] appears to mess up the early_identify_cpu().
> The first half of the old two-phased setup remains and also there are double
> get_cpu_address_sizes() invocations before/after
> setup_force_cpu_cap(X86_FEATURE_CPUID). As an aside, one of #VC handler issues
> seems to be remaining as well and it would surprise someone in the future.
> 
> Did you intentionally avoid backporting fbf6449f84bf ("x86/sev-es: Set
> x86_virt_bits to the correct value straight away, instead of a two-phase
> approach") as a prerequisite?
> 
> If so, can you explain the reason for it somewhere (ML or provenance section)
> especially since you're aiming generic master kernel and such delicate part
> of the code (that historically has been fixed multiple times) would diverge
> from any upstream revision. I would backport fbf6449f84bf as prerequisite though.
> 
> Thanks.

Thanks for the review, these criticisms of the patchset are well-founded, I'll NACK
it and return with a v2. I was originally motivated by addressing a customer issue,
and applied what was suggested to me. Since then I've found that only the second
commit (f6b980646b93) is strictly necessary to solve the issue for Google, and
seems to not have dependencies like the first one does. My plan now is to only pull
in that commit and leave this delicate part of the kernel as it is otherwise.

-Ian