[Bug 1915811] Re: Empty NUMA topology in machines with high number of CPUs

Victor Tapia 1915811 at bugs.launchpad.net
Fri Mar 12 10:07:49 UTC 2021


#VERIFICATION XENIAL

Using the test case described in the description, where a VM has 128
vcpus assigned, the version in -updates does not list the topology:

$ dpkg -l | grep libvirt
ii  libvirt-bin                            1.3.1-1ubuntu10.30                              amd64        programs for the libvirt library
ii  libvirt0:amd64                         1.3.1-1ubuntu10.30                              amd64        library for interfacing with different virtualization systems

$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
XPath set is empty

The package in -proposed fixes the issue (output shortened):

$ dpkg -l | grep libvirt
ii  libvirt-bin                            1.3.1-1ubuntu10.31                              amd64        programs for the libvirt library
ii  libvirt0:amd64                         1.3.1-1ubuntu10.31                              amd64        library for interfacing with different virtualization systems

$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
<topology>
      <cells num="1">
        <cell id="0">
          <memory unit="KiB">4998464</memory>
          <pages unit="KiB" size="4">1249616</pages>
          <pages unit="KiB" size="2048">0</pages>
          <pages unit="KiB" size="1048576">0</pages>
          <distances>
            <sibling id="0" value="10"/>
          </distances>
          <cpus num="128">
            <cpu id="0" socket_id="0" core_id="0" siblings="0"/>
	    ...
            <cpu id="127" socket_id="127" core_id="0" siblings="127"/>
          </cpus>
        </cell>
      </cells>
    </topology>

NOTE: if the machine is running a 4.4 kernel, numa_all_cpus_ptr->size
(used to set max_n_cpus in libvirt) is 512 instead of 128 and the issue
cannot be triggered (libvirt max vcpu is 255). Any newer kernel, such as
HWE, sets the value to 128, triggering the issue.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1915811

Title:
  Empty NUMA topology in machines with high number of CPUs

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Committed
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in libvirt package in Ubuntu:
  Fix Released
Status in libvirt source package in Xenial:
  Fix Committed
Status in libvirt source package in Bionic:
  Fix Committed
Status in libvirt source package in Focal:
  Fix Committed
Status in libvirt source package in Groovy:
  Fix Committed

Bug description:
  [impact]

  libvirt fails to populate its NUMA topology when the machine has a
  large number of CPUs assigned to a single node. This happens when the
  number of CPUs fills the bitmask (all to one), hitting a workaround
  introduced to build the NUMA topology on machines that have non
  contiguous node ids. This has been already fixed upstream in the
  commits listed below.

  [scope]

  The fix is needed for Xenial, Bionic, Focal and Groovy.

  It's fixed upstream with commits 24d7d85208 and 551fb778f5 which are
  included in v6.8, so both are already in hirsute.

  [test case]

  On a machine like the EPYC 7702P, after setting the NUMA config to
  NPS1 (single node per processor), or just a VM with 128 CPUs, "virsh
  capabilities" does not show the NUMA topology:

  # virsh capabilities | xmllint --xpath '/capabilities/host/topology' -

  <topology>
        <cells num="0">
        </cells>
      </topology>

  When it should show (edited to shorten the description):

  <topology>
        <cells num="1">
          <cell id="0">
            <memory unit="KiB">5027820</memory>
            <pages unit="KiB" size="4">1256955</pages>
            <pages unit="KiB" size="2048">0</pages>
            <distances>
              <sibling id="0" value="10"/>
            </distances>
            <cpus num="128">
              <cpu id="0" socket_id="0" core_id="0" siblings="0"/>
              ....
              <cpu id="127" socket_id="127" core_id="0" siblings="127"/>
            </cpus>
          </cell>
        </cells>
      </topology>

  
  [Where problems could occur]

  Any regression would likely involve a misconstruction of the NUMA
  topology, in particular for machines with non contiguous node ids.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1915811/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list