[Bug 1818239] Re: scheduler: build failure high negative weighting
Matt Riedemann
mriedem.os at gmail.com
Tue Mar 5 22:35:11 UTC 2019
I've marked this as incomplete for nova since I'm not aware of any
changes being asked to make here. The build failure weigher was added
because of bug 1742102 and in response to operator feedback from the
Boston summit to auto-disable computes if they experienced a build
failure. So the auto-disable thing went into I think Pike, and then bug
1742102 talked about how that was too heavy weight because there were
easy ways to auto-disable a lot of computes in a deployment (e.g. volume
over-quota from multiple concurrent boot from volume server create
requests). So Dan added the weigher stuff which can be configured to
weigh hosts with build failures lower so they don't become a black hole,
and once the host has a successful build the stats tracking for that
compute node is reset.
Anyway, it's up to charms if it wants to disable it by default, but note
it was added for a reason and defaults to being "on" for a reason as
well (per operator feedback).
** Changed in: nova
Status: New => Incomplete
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to nova in Ubuntu.
https://bugs.launchpad.net/bugs/1818239
Title:
scheduler: build failure high negative weighting
Status in OpenStack nova-cloud-controller charm:
Fix Committed
Status in OpenStack Compute (nova):
Incomplete
Status in nova package in Ubuntu:
Triaged
Bug description:
Whilst debugging a Queens cloud which seems to be landing all new
instances on 3 out of 9 hypervisors (which resulted in three very
heavily overloaded servers) I noticed that the weighting of the build
failure weighter is -1000000.0 * number of failures:
https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495
This means that a server which has any sort of build failure instantly
drops to the bottom of the weighed list of hypervisors for scheduling
of instances.
Why might a instance fail to build? Could be a timeout due to load,
might also be due to a bad image (one that won't actually boot under
qemu). This second cause could be triggered by an end user of the
cloud inadvertently causing all instances to be pushed to a small
subset of hypervisors (which is what I think happened in our case).
This feels like quite a dangerous default to have given the potential
to DOS hypervisors intentionally or otherwise.
ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: nova-scheduler 2:17.0.7-0ubuntu1
ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
Uname: Linux 4.15.0-43-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.5
Architecture: amd64
Date: Fri Mar 1 13:57:39 2019
NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
PackageArchitecture: all
ProcEnviron:
TERM=screen-256color
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=C.UTF-8
SHELL=/bin/bash
SourcePackage: nova
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1818239/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list