[Bug 1763442] Re: instance creation fails with "Failed to allocate the network(s), not rescheduling." because neutron-ovs-agent rpc_loop took too long
Skia
1763442 at bugs.launchpad.net
Thu Jun 20 10:05:19 UTC 2024
Is this bug still relevant nowadays? autopkgtest-cloud has some workaround for this, as asked in bug #1763445.
Is this workaround something we can get rid of?
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1763442
Title:
instance creation fails with "Failed to allocate the network(s), not
rescheduling." because neutron-ovs-agent rpc_loop took too long
Status in Ubuntu Cloud Archive:
New
Status in neutron package in Ubuntu:
Confirmed
Bug description:
Hi,
I'm running a cloud with 18.02 charms, and origin cloud:xenial-ocata.
We use openvswitch and GRE tunnels for networking. We use l2pop.
Today I investigated an instance creation failure. The "fault" message
in "nova show" was the following :
Build of instance 59ea33ee-b167-48e9-ba61-e041aa3c0059 aborted: Failed
to allocate the network(s), not rescheduling.", "code": 500,
"details": " File \"/usr/lib/python2.7/dist-
packages/nova/compute/manager.py\", line 1788, in
_do_build_and_run_instance
nova-compute log revealed the following :
2018-04-12 05:58:18.639 1852 ERROR nova.compute.manager [instance:
59ea33ee-b167-48e9-ba61-e041aa3c0059] VirtualInterfaceCreateException:
Virtual Interface creation failed
because the vif-plugged event never came in (below is the "preparing
for event" message) :
2018-04-12 05:53:14.539 1852 DEBUG nova.compute.manager
[req-9fd7bef1-8c17-4659-afdd-ac490a7d548e
eacfab4914674b3d9225255198b10fcd 169e085b0bc7408490e8165dc1d1b19c - -
-] [instance: 59ea33ee-b167-48e9-ba61-e041aa3c0059] Preparing to wait
for external event network-vif-
plugged-16389f31-d9b0-4fba-94cc-a28c1f504b19
prepare_for_instance_event /usr/lib/python2.7/dist-
packages/nova/compute/manager.py:328
I got a bit surprised because everything appeared to have worked well.
Looking at neutron-ovs-agent logs, I found the reason the event never
got sent : the rpc_loop that was running when the instance got created
lasted more than 5 minutes !
Beginning of rpc_loop :
2018-04-12 05:52:59.287 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 started rpc_loop /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1951
End of rpc_loop :
2018-04-12 05:58:37.857 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 completed. Processed ports statistics: {'regular': {'updated': 7, 'added': 7, 'removed': 7}}. Elapsed:338.570 loop_count_and_wait /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771
As you can see, it lasted 338.570 seconds. Most of this time was spent running "conntrack -D" commands, such as :
2018-04-12 05:55:44.984 3796 DEBUG neutron.agent.linux.utils [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'conntrack', '-D', '-f', 'ipv4', '-d', '10.X.Y.Z', '-w', '10014', '-s', '10.A.B.C'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:97
There were 2 batches of those : a first one with 176 commands, and a
second one with 194 commands. With each command taking ~0.90 seconds
(the machines are pretty heavily loaded, it's the call to neutron-
rootwrap that takes most of the time, not the conntrack deletion
itself), it does add up.
These conntrack commands are created from /usr/lib/python2.7/dist-
packages/neutron/agent/linux/iptables_firewall.py, function
update_security_group_members() which then calls
_update_ipset_members() which does the following :
for ip_version, current_ips in sg_members.items():
add_ips, del_ips = self.ipset.set_members(
sg_id, ip_version, current_ips)
if devices and del_ips:
# remove prefix from del_ips
ips = [str(netaddr.IPNetwork(del_ip).ip) for del_ip in del_ips]
self.ipconntrack.delete_conntrack_state_by_remote_ips(
devices, ip_version, ips)
I believe it's trying to remove any existing conntrack entry when a removed from a secgroup.
And it looks like there's no safeguard in place to make sure that this runs in a limited amount of time, or across multiple rpc_loops.
So nova-compute times out waiting for the vif-plugged-event, and
instance creation fails.
It should be easy to reproduce if you create a bunch of instances in
the same secgroup (we were at 28), delete one, and immediately create
one (on the same compute node I guess ?). You can also add a small
sleep of 0.5s to neutron-rootwrap to simulate the load.
You can find the full log of the rpc_loop iteration at
https://pastebin.canonical.com/p/hqKb3Rg72s/ (sorry, Canonical-only
link)
Thanks !
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1763442/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list