[Bug 1734227] Re: frequent spawn failures on busy ocata cloud

Thu Jan 25 19:31:40 UTC 2018

Hi Paul,

Thanks for reporting this

Based on the port binding attempt occurring ~5 minutes later it seems
you're on to something in the theory that neutron-openvswitch-agent
somehow got stuck for long enough that nova-compute timed out. If you
see this again can you share any more details system load for
nova/neutron? /var/log/syslog may be helpful as well.

Adding upstream projects to this as well in case they've seen similar
issues..

Thanks,
Corey

** Also affects: neutron
   Importance: Undecided
       Status: New

** Also affects: nova
   Importance: Undecided
       Status: New

** No longer affects: nova

** Also affects: nova
   Importance: Undecided
       Status: New

** Changed in: cloud-archive
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1734227

Title:
  frequent spawn failures on busy ocata cloud

Status in Ubuntu Cloud Archive:
  Incomplete
Status in Ubuntu Cloud Archive ocata series:
  New
Status in neutron:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  On one of our OpenStack deployments, which runs ocata from the
  Ubuntu Cloud Archive on Ubuntu 16.04 LTS, we often find that
  instances fail to spawn as follows.  We generally observe this
  problem more frequently on older compute hosts when the spawn rate
  is high enough (this cloud runs mostly short-lived instances).

  From nova show's "fault" field:

   {"message": "Build of instance 1d005a25-8168-4dc5-807f-fee261244fce aborted: Failed to allocate the network(s), not rescheduling.", "code": 5
  2.7/dist-packages/nova/compute/manager.py\", line 1781, in _do_build_and_run_instance
       filter_properties)
     File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1991, in _build_and_run_instance
       reason=msg)
   ", "created": "2017-11-23T00:55:09Z"}

  Reviewing nova-compute.log reveals the following:

  2017-11-23 00:50:01.027 4173 DEBUG nova.compute.manager [req-c8ca7a40-7960-4d64-99e6-80819c8d2c4a 9de5c125a94e46d696be0d4878570867 c23bfbac9c1045c5bbc632d950a5bdd1 - - -] [instance: 1d005a25-8168-4dc5-807f-fee261244fce] Starting instance... _do_build_and_run_instance /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1752
  [...]
  2017-11-23 00:55:05.978 4173 WARNING nova.virt.libvirt.driver [req-c8ca7a40-7960-4d64-99e6-80819c8d2c4a 9de5c125a94e46d696be0d4878570867 c23bfbac9c1045c5bbc632d950a5bdd1 - - -] [instance: 1d005a25-8168-4dc5-807f-fee261244fce] Timeout waiting for vif plugging callback for instance 1d005a25-8168-4dc5-807f-fee261244fce

  Reviewing the neutron-server logs, I see e.g.:

  2017-11-23 00:55:08.951 1755 DEBUG neutron.notifiers.nova [-] Sending
  events: [{'tag': u'240480bb-8cbf-4055-94af-28beff621f21', 'name':
  'network-vif-deleted', 'server_uuid': u'1d005a25-8168-4dc5-807f-
  fee261244fce'}] send_events /usr/lib/python2.7/dist-
  packages/neutron/notifiers/nova.py:252

  but no such line for vif-network-plugged, which probably serves to
  confirm that nova did not receive the network-vif-plugged event
  because neutron did not send it in the first place; if neutron had
  even tried to send the event, it would have logged it.

  Now for neutron-openvswitch-agent.log on the compute host, where
  this last line below looks rather strange, or at least it does if
  you've had a long enough look at OVSNeutronAgent._bind_devices:

  2017-11-23 00:50:03.903 2800 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-8eb54e38-9f47-4f9e-88bd-ba7be5eb9750 b8cc253a896241748f88b380ca155afc 04f691318d40461d98fc1514c839ee67 - - -] port_update message processed for port 240480bb-8cbf-4055-94af-28beff621f21 port_update /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:402
  2017-11-23 00:50:04.451 2800 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["ab0e6871-f9e3-4633-95a2-3f61f9e4ea7e","insert","qvo240480bb-8c",["set",[]],["map",[["attached-mac","fa:16:3e:fb:94:63"],["iface-id","240480bb-8cbf-4055-94af-28beff621f21"],["iface-status","active"],["vm-uuid","1d005a25-8168-4dc5-807f-fee261244fce"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/dist-packages/neutron/agent/linux/async_process.py:234
  [...]
  2017-11-23 00:56:44.314 2800 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3d7d1ee9-7da8-4963-b7d1-aea24cbd6343 - - - - -] Port qvo240480bb-8c was deleted concurrently, skipping it _bind_devices /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:855

  _bind_devices looks like this
  (full version at https://github.com/openstack/neutron/blob/10.0.3/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L834):

      def _bind_devices(self, need_binding_ports):
          # [...]
          port_names = [p['vif_port'].port_name for p in need_binding_ports]
          port_info = self.int_br.get_ports_attributes(
              "Port", columns=["name", "tag"], ports=port_names, if_exists=True)
          tags_by_name = {x['name']: x['tag'] for x in port_info}
          for port_detail in need_binding_ports:
              # [...]
              # Do not bind a port if it's already bound
              cur_tag = tags_by_name.get(port.port_name)
              if cur_tag is None:
                  LOG.debug("Port %s was deleted concurrently, skipping it",
                            port.port_name)
                  continue

  which seems to suggest that was binding only even attempted five
  minutes later, *after* nova had given up and deleted the port,
  which seems rather strange, if it's true.  Although another theory
  that just occurred to me is that maybe neutron-openvswitch-agent
  somehow got "stuck" for long enough that nova-compute timed out.

  This is also the only line that _bind_devices logs for such ports
  that fail to launch in this manner on our cloud.  There is no
  logging corresponding to "Setting status for %s to UP",
  "Configuration for devices %s failed!", or to "Configuration for
  devices up %(up)s and devices down %(down)s completed." (see
  further down in the code at the github link above).

  I cherry-picked commit c6044fb86174caa5ec142a60a1f442764b935e5a
  from ocata (it looks like it will be in neutron 10.0.5; "git
  show": http://paste.ubuntu.com/26031447/) as it seemed like it
  could be relevant, but it did not improve the situation markedly,
  although we have logged a few cases where the code was invoked.

  We have another similar cloud hosting similar loads, but running
  mitaka, that does not evince this problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1734227/+subscriptions