[Bug 1818239] Re: scheduler: build failure high negative weighting

Mon May 6 05:53:18 UTC 2024

I'm at a tad late (5+ years after the last comment), but in case it
helps, we just ran into a situation where a user (not cloud admin)
exhausted their IP allocations in a user-defined subnet, which caused an
error in the nova-compute.log:

NoMoreFixedIps: No fixed IP addresses available for network:
62c7b2e2-4406-4dda-9a01-f0ec623335d4

which caused the BuildFailureWeigher to offset the weight by -1000000.

This was in a Stein cluster, so quite old.  A newer version may have
this disabled by default, but we had to manually disable it.
Regardless, it caused all other tenants' VM deployments to fail since
every compute node that tried to deploy the above VM resulted in the
above error, causing each compute node to be assigned a large negative
weight.

Just wanted to mention our experience in case it helped someone else who
was searching for the problem.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1818239

Title:
  scheduler: build failure high negative weighting

Status in OpenStack Nova Cloud Controller Charm:
  Fix Released
Status in OpenStack Compute (nova):
  Incomplete
Status in OpenStack Security Advisory:
  Won't Fix
Status in nova package in Ubuntu:
  Triaged

Bug description:
  Whilst debugging a Queens cloud which seems to be landing all new
  instances on 3 out of 9 hypervisors (which resulted in three very
  heavily overloaded servers) I noticed that the weighting of the build
  failure weighter is -1000000.0 * number of failures:

  https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495

  This means that a server which has any sort of build failure instantly
  drops to the bottom of the weighed list of hypervisors for scheduling
  of instances.

  Why might a instance fail to build? Could be a timeout due to load,
  might also be due to a bad image (one that won't actually boot under
  qemu).  This second cause could be triggered by an end user of the
  cloud inadvertently causing all instances to be pushed to a small
  subset of hypervisors (which is what I think happened in our case).

  This feels like quite a dangerous default to have given the potential
  to DOS hypervisors intentionally or otherwise.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: nova-scheduler 2:17.0.7-0ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
  Uname: Linux 4.15.0-43-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.5
  Architecture: amd64
  Date: Fri Mar  1 13:57:39 2019
  NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
  PackageArchitecture: all
  ProcEnviron:
   TERM=screen-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=C.UTF-8
   SHELL=/bin/bash
  SourcePackage: nova
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1818239/+subscriptions