[Bug 1763442] Re: instance creation fails with "Failed to allocate the network(s), not rescheduling." because neutron-ovs-agent rpc_loop took too long

Thu Jun 20 10:05:19 UTC 2024

Is this bug still relevant nowadays? autopkgtest-cloud has some workaround for this, as asked in bug #1763445.
Is this workaround something we can get rid of?

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1763442

Title:
  instance creation fails with "Failed to allocate the network(s), not
  rescheduling." because neutron-ovs-agent rpc_loop took too long

Status in Ubuntu Cloud Archive:
  New
Status in neutron package in Ubuntu:
  Confirmed

Bug description:
  Hi,

  I'm running a cloud with 18.02 charms, and origin cloud:xenial-ocata.
  We use openvswitch and GRE tunnels for networking. We use l2pop.

  Today I investigated an instance creation failure. The "fault" message
  in "nova show" was the following :

  Build of instance 59ea33ee-b167-48e9-ba61-e041aa3c0059 aborted: Failed
  to allocate the network(s), not rescheduling.", "code": 500,
  "details": "  File \"/usr/lib/python2.7/dist-
  packages/nova/compute/manager.py\", line 1788, in
  _do_build_and_run_instance

  nova-compute log revealed the following :

  2018-04-12 05:58:18.639 1852 ERROR nova.compute.manager [instance:
  59ea33ee-b167-48e9-ba61-e041aa3c0059] VirtualInterfaceCreateException:
  Virtual Interface creation failed

  because the vif-plugged event never came in (below is the "preparing
  for event" message) :

  2018-04-12 05:53:14.539 1852 DEBUG nova.compute.manager
  [req-9fd7bef1-8c17-4659-afdd-ac490a7d548e
  eacfab4914674b3d9225255198b10fcd 169e085b0bc7408490e8165dc1d1b19c - -
  -] [instance: 59ea33ee-b167-48e9-ba61-e041aa3c0059] Preparing to wait
  for external event network-vif-
  plugged-16389f31-d9b0-4fba-94cc-a28c1f504b19
  prepare_for_instance_event /usr/lib/python2.7/dist-
  packages/nova/compute/manager.py:328

  I got a bit surprised because everything appeared to have worked well.
  Looking at neutron-ovs-agent logs, I found the reason the event never
  got sent : the rpc_loop that was running when the instance got created
  lasted more than 5 minutes !

  Beginning of rpc_loop :
  2018-04-12 05:52:59.287 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 started rpc_loop /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1951

  End of rpc_loop :
  2018-04-12 05:58:37.857 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 completed. Processed ports statistics: {'regular': {'updated': 7, 'added': 7, 'removed': 7}}. Elapsed:338.570 loop_count_and_wait /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

  As you can see, it lasted 338.570 seconds. Most of this time was spent running "conntrack -D" commands, such as :
  2018-04-12 05:55:44.984 3796 DEBUG neutron.agent.linux.utils [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'conntrack', '-D', '-f', 'ipv4', '-d', '10.X.Y.Z', '-w', '10014', '-s', '10.A.B.C'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:97

  There were 2 batches of those : a first one with 176 commands, and a
  second one with 194 commands. With each command taking ~0.90 seconds
  (the machines are pretty heavily loaded, it's the call to neutron-
  rootwrap that takes most of the time, not the conntrack deletion
  itself), it does add up.

  These conntrack commands are created from /usr/lib/python2.7/dist-
  packages/neutron/agent/linux/iptables_firewall.py, function
  update_security_group_members() which then calls
  _update_ipset_members() which does the following :

  for ip_version, current_ips in sg_members.items():
      add_ips, del_ips = self.ipset.set_members(
          sg_id, ip_version, current_ips)
      if devices and del_ips:
          # remove prefix from del_ips
          ips = [str(netaddr.IPNetwork(del_ip).ip) for del_ip in del_ips]
          self.ipconntrack.delete_conntrack_state_by_remote_ips(
              devices, ip_version, ips)

  I believe it's trying to remove any existing conntrack entry when a removed from a secgroup.
  And it looks like there's no safeguard in place to make sure that this runs in a limited amount of time, or across multiple rpc_loops.

  So nova-compute times out waiting for the vif-plugged-event, and
  instance creation fails.

  It should be easy to reproduce if you create a bunch of instances in
  the same secgroup (we were at 28), delete one, and immediately create
  one (on the same compute node I guess ?). You can also add a small
  sleep of 0.5s to neutron-rootwrap to simulate the load.

  You can find the full log of the rpc_loop iteration at
  https://pastebin.canonical.com/p/hqKb3Rg72s/ (sorry, Canonical-only
  link)

  Thanks !

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1763442/+subscriptions