[Bug 1751923] Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

sean mooney 1751923 at bugs.launchpad.net
Tue Oct 6 18:58:36 UTC 2020


for what its worth this has been partially backported downstream in
redhat osp

we backported only the self healying and not the online data migration
which had a bug in it.

so https://review.opendev.org/#/c/591607/ can be safely backport ported
but https://review.opendev.org/#/c/614167/20 has a bug and should not be
backported.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1751923

Title:
  _heal_instance_info_cache periodic task bases on port list from nova
  db, not from neutron server

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  In Progress
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  Confirmed
Status in nova source package in Bionic:
  Confirmed
Status in nova source package in Disco:
  Fix Released

Bug description:
  Description
  ===========

  During periodic task _heal_instance_info_cache the
  instance_info_caches are not updated using instance port_ids taken
  from neutron, but from nova db.

  Sometimes, perhaps because of some race-condition, its possible to
  lose some ports from instance_info_caches. Periodic task
  _heal_instance_info_cache should clean this up (add missing records),
  but in fact it's not working this way.

  How it looks now?
  =================

  _heal_instance_info_cache during crontask:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525

  is using network_api to get instance_nw_info (instance_info_caches):

              try:
                  # Call to network API to get instance info.. this will
                  # force an update to the instance's info_cache
                  self.network_api.get_instance_nw_info(context, instance)

  self.network_api.get_instance_nw_info() is listed below:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377

  and it uses _build_network_info_model() without networks and port_ids
  parameters (because we're not adding any new interface to instance):

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356

  Next: _gather_port_ids_and_networks() generates the list of instance
  networks and port_ids:

        networks, port_ids = self._gather_port_ids_and_networks(
                  context, instance, networks, port_ids, client)

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393

  As we see that _gather_port_ids_and_networks() takes the port list
  from DB:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176

  And thats it. When we lose a port its not possible to add it again with this periodic task.
  The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.

  When the interface is missing and there is no port configured on
  compute host (for example after compute reboot) - interface is not
  added to instance and from neutron point of view port state is DOWN.

  When the interface is missing in cache and we reboot hard the instance
  - its not added as tapinterface in xml file = we don't have the
  network on host.

  Steps to reproduce
  ==================
  1. Spawn devstack
  2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
  3. Update the DB row, drop one interface from interfaces_list
  4. Hard-Reboot the instance
  5. See that nova list shows instance without one address, but nova interface-list shows all addresses
  6. See that one port is missing in instance xml files
  7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.

  Reproduced Example
  ==================
  1. Spawn VM with 1 private network port
  nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private  test-2
  2. Attach ports to have 2 private and 2 public interfaces
  nova list:
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -          | Running     | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |

  So we see 4 ports:
  stack at mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | Port State | Port ID                              | Net ID                               | IP addresses                                  | MAC Addr          |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | ACTIVE     | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
  | ACTIVE     | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c                       | fa:16:3e:6d:dc:85 |
  | ACTIVE     | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e                       | fa:16:3e:cf:0c:e0 |
  | ACTIVE     | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  stack at mjozefcz-devstack-ptg:~$

  We can also see 4 tap interfaces in xml file:

  stack at mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
        <target dev='tap6c230305-43'/>
        <target dev='tapb89d6863-fb'/>
        <target dev='tapa74c9ee8-c4'/>
        <target dev='tap71e6c6ad-80'/>
  stack at mjozefcz-devstack-ptg:~$

  3. Now lets 'corrupt' the instance_info_caches for this specific VM.
  We also noticed some race-condition that cause the same problem, but
  we're unable to reproduce it in devel environment.

  Original one:

  ---
  mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
  *************************** 1. row ***************************
     created_at: 2018-02-26 21:25:31
     updated_at: 2018-02-26 21:29:17
     deleted_at: NULL
             id: 2
   network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
  instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
        deleted: 0
  1 row in set (0.00 sec)
  ----

  Modified one (I removed first port from list):
  tap6c230305-43

  ----
  mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
  *************************** 1. row ***************************
     created_at: 2018-02-26 21:25:31
     updated_at: 2018-02-26 21:29:17
     deleted_at: NULL
             id: 2
   network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
  instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
        deleted: 0
  ----

  4. Now lets take a look on `nova list`:

  stack at mjozefcz-devstack-ptg:~$ nova list | grep test-2
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -          | Running     | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
  stack at mjozefcz-devstack-ptg:~$

  So as you see we missed one interface (private).

  Nova interface-list shows it (because it calls neutron instead nova
  itself):

  stack at mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | Port State | Port ID                              | Net ID                               | IP addresses                                  | MAC Addr          |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | ACTIVE     | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
  | ACTIVE     | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c                       | fa:16:3e:6d:dc:85 |
  | ACTIVE     | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e                       | fa:16:3e:cf:0c:e0 |
  | ACTIVE     | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  stack at mjozefcz-devstack-ptg:~$

  5. During this time check the logs - yes, the
  _heal_instance_info_cache has been running for a while but without
  success - stil missing port in instance_info_caches table:

  Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
  Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
  Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}

  5. Ok, so lets pretend that customer restart the VM.
  stack at mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
  Request to reboot server <Server: test-2> has been accepted.

  6. And now check connected interfaces - WOOPS there is no
  `tap6c230305-43` on the list ;(

  stack at mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002  | grep -i tap
        <target dev='tapb89d6863-fb'/>
        <target dev='tapa74c9ee8-c4'/>
        <target dev='tap71e6c6ad-80'/>

  Environment
  ===========
  Nova master branch, devstack

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1751923/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list