ACK: [SRU][N][PATCH 0/8] Ubuntu 24.04 - GPU cannot be installed with DL380a Gen12 (2P, SRF-SP)

Mehmet Basaran mehmet.basaran at canonical.com
Mon Oct 7 12:13:13 UTC 2024


Acked-by: Mehmet Basaran <mehmet.basaran at canonical.com>

-------------- next part --------------
Michael Reed <michael.reed at canonical.com> writes:

> From: Michael Reed <Michael.Reed at canonical.com>
>
> BugLink: https://bugs.launchpad.net/bugs/2081079
>
> SRU Justification:
>
> [Impact]
> Description:
> Failed to install GPU with Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P
>
> There is a random write to VF BAR0's memory region that causes the kernel got MCE error.
>
> Version-Release number :
> Ubuntu 24.04
>
> Additional info:
>
> We have tracked this issue with RHEL9.4, it's caused by the following pathes.
>
> cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1)
> 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1)
> 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
> b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1)
> cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1)
>
> [Test Plan]
> How reproducible:
> Each time
>
> Steps to reproduce
> - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
> - Run a fresh install on a DL380a server with 2P with GPU in slot17
>
> Expected results
> No MCE and run installation w/o problem
>
> Actual results
> The kernel got MCE errors.
>
> [Fix]
> Intel gave us a patch set that resolves the issue.
> https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.liang@linux.intel.com/#r
>
> The following patches are required.
>
> f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters (v6.11-rc1)
> 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1)
> f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore units (v6.11-rc1)
> b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units (v6.11-rc1)
> 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units (v6.11-rc1)
> 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB tree (v6.11-rc1)
> c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1)
> 0007f3932592 perf/x86/uncore: Save the unit control address of all units (v6.11-rc1)
>
> [Where problems could occur]
>
> [Other Info]
>
> https://code.launchpad.net/~mreed8855/ubuntu/+source/linux/+git/noble/+ref/lp_2081079_dl380a_gen12
>
> Kan Liang (8):
>   perf/x86/uncore: Save the unit control address of all units
>   perf/x86/uncore: Support per PMU cpumask
>   perf/x86/uncore: Retrieve the unit ID from the unit control RB tree
>   perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units
>   perf/x86/uncore: Apply the unit control RB tree to MSR uncore units
>   perf/x86/uncore: Apply the unit control RB tree to PCI uncore units
>   perf/x86/uncore: Cleanup unused unit structure
>   perf/x86/intel/uncore: Support HBM and CXL PMON counters
>
>  arch/x86/events/intel/uncore.c           |  97 ++++---
>  arch/x86/events/intel/uncore.h           |   8 +-
>  arch/x86/events/intel/uncore_discovery.c | 306 +++++++++++++++--------
>  arch/x86/events/intel/uncore_discovery.h |  22 +-
>  arch/x86/events/intel/uncore_snbep.c     | 128 ++++++++--
>  5 files changed, 388 insertions(+), 173 deletions(-)
>
> -- 
> 2.34.1
>
>
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 873 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20241007/0f7ff5df/attachment.sig>


More information about the kernel-team mailing list