[Bug 1818239] Re: scheduler: build failure high negative weighting

Wed Mar 6 18:25:59 UTC 2019

With the weigher, you shouldn't be able to "take down" anything. You may
stack a lot more instances on the non-error-reporting hosts, but once
those are full, the scheduler will try one fo the hosts reporting
errors, and as soon as one succeeds there, the score resets to zero. So
can you clarify "took down" in this context?

Also, the weight given to this weigher, like all others, is
configurable. If you have no desire to deprioritize failing hosts, you
can set it to zero, and if you want this to have a smaller impact then
you can change the weight to something smaller. The default weight was
carefully chosen to cause a failing host to have a lower weight than
others, all things equivalent. Since the disk weigher scales by free
bytes (or whatever), if you're a new compute node that has no instances
(and thus a lot of free space) and a bad config that will cause you to
fail every boot, the fail weigher has to have an impactful score, else
it really will have no effect.

I've nearly lost the will to even argue about this issue, so I'm not
sure what my opinion is on setting the default to zero, other than to
say that the converse argument is also true... If you have one compute
node with a broken config (or even just something preventing it from
talking to neutron), it will attract all builds in the scheduler, fail
them, and the cloud is effectively down until a human is paged to remedy
the situation. That was the case this was originally trying to mitigate
in its original and evolved forms.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1818239

Title:
  scheduler: build failure high negative weighting

Status in OpenStack nova-cloud-controller charm:
  Fix Committed
Status in OpenStack Compute (nova):
  Incomplete
Status in nova package in Ubuntu:
  Triaged

Bug description:
  Whilst debugging a Queens cloud which seems to be landing all new
  instances on 3 out of 9 hypervisors (which resulted in three very
  heavily overloaded servers) I noticed that the weighting of the build
  failure weighter is -1000000.0 * number of failures:

  https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495

  This means that a server which has any sort of build failure instantly
  drops to the bottom of the weighed list of hypervisors for scheduling
  of instances.

  Why might a instance fail to build? Could be a timeout due to load,
  might also be due to a bad image (one that won't actually boot under
  qemu).  This second cause could be triggered by an end user of the
  cloud inadvertently causing all instances to be pushed to a small
  subset of hypervisors (which is what I think happened in our case).

  This feels like quite a dangerous default to have given the potential
  to DOS hypervisors intentionally or otherwise.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: nova-scheduler 2:17.0.7-0ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
  Uname: Linux 4.15.0-43-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.5
  Architecture: amd64
  Date: Fri Mar  1 13:57:39 2019
  NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
  PackageArchitecture: all
  ProcEnviron:
   TERM=screen-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=C.UTF-8
   SHELL=/bin/bash
  SourcePackage: nova
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1818239/+subscriptions