[Bug 1762368] [NEW] reinstalling a compute node and then upgrading from pike to queens fails

Mon Apr 9 10:19:10 UTC 2018

Public bug reported:

Hi,

I had a working xenial/pike cloud recently, using neutron-ovs, with some
compute nodes, in particular a ppc64 compute node named bagon. I needed
to reinstall it, so I did the following :

1. nova service-delete <id of the compute service on bagon>
2. neutron agent-delete <uuid of the openvswitch agent on bagon>
3. Re-commission the node and deploy the nova-compute application on it

After what, some times later, I upgraded the cloud to queens. This
apparently caused the node to stop working. It was logging the following
error (nova-compute.log on bagon) :

2018-04-09 06:25:26.099 128068 ERROR nova.scheduler.client.report [req-
f1eebe14-fcfb-4878-b557-50105790d3b5 6bd667e324ea463abaacbc1f9c3bbed3
95cafd7ede504ef6b7b67ead691d3883 - default default] [req-29de76b9-50c2
-4bff-85a9-363d665c250f] Failed to create resource provider record in
placement API for UUID 2d236848-df06-47f1-92a4-a1afefe62931. Got 409:
{"errors": [{"status": 409, "request_id": "req-29de76b9-50c2-4bff-
85a9-363d665c250f", "detail": "There was a conflict when trying to
complete your request.\n\n Conflicting resource provider name:
bagon.fqdn already exists.  ", "title": "Conflict"}]}.

Full stack trace : https://pastebin.canonical.com/p/ynhpgsB8bp/ (sorry,
Canonical-only link)

I tracked down the problem, and found it was due to the following
mismatch :

mysql> select uuid,host,deleted from compute_nodes where host='bagon';
+--------------------------------------+-------+---------+
| uuid                                 | host  | deleted |
+--------------------------------------+-------+---------+
| 2d236848-df06-47f1-92a4-a1afefe62931 | bagon |       0 |
| 92232041-9767-466b-a82f-20ecef0af6fa | bagon |       9 |
+--------------------------------------+-------+---------+
2 rows in set (0.00 sec)

mysql> use nova_api;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select uuid,name from resource_providers where name like 'bagon%';
+--------------------------------------+--------------------------+
| uuid                                 | name                     |
+--------------------------------------+--------------------------+
| 92232041-9767-466b-a82f-20ecef0af6fa | bagon.fqdn               |
+--------------------------------------+--------------------------+
1 row in set (0.00 sec)

The nova.compute_nodes table has 2 records for bagon, as expected : one
is the old, deleted record and the other the current, live record.

The problem, as you can see above, is that the
nova_api.resource_providers table had the old UUID for bagon. I'm not
exactly sure at what point nova-compute on bagon started failing, I'm
fairly confident it was OK after the reinstall, so I suspect something
happened during the migration from pike to queens.

I manually updated the UUID in the resource_providers table, and bagon
started working fine.

I can't try to repro because I can't downgrade the cluster to try the
pike=>queens upgrade a second time, but hopefully you can.

Thanks !

** Affects: cloud-archive
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1762368

Title:
  reinstalling a compute node and then upgrading from pike to queens
  fails

Status in Ubuntu Cloud Archive:
  New

Bug description:
  Hi,

  I had a working xenial/pike cloud recently, using neutron-ovs, with
  some compute nodes, in particular a ppc64 compute node named bagon. I
  needed to reinstall it, so I did the following :

  1. nova service-delete <id of the compute service on bagon>
  2. neutron agent-delete <uuid of the openvswitch agent on bagon>
  3. Re-commission the node and deploy the nova-compute application on it

  After what, some times later, I upgraded the cloud to queens. This
  apparently caused the node to stop working. It was logging the
  following error (nova-compute.log on bagon) :

  2018-04-09 06:25:26.099 128068 ERROR nova.scheduler.client.report
  [req-f1eebe14-fcfb-4878-b557-50105790d3b5
  6bd667e324ea463abaacbc1f9c3bbed3 95cafd7ede504ef6b7b67ead691d3883 -
  default default] [req-29de76b9-50c2-4bff-85a9-363d665c250f] Failed to
  create resource provider record in placement API for UUID
  2d236848-df06-47f1-92a4-a1afefe62931. Got 409: {"errors": [{"status":
  409, "request_id": "req-29de76b9-50c2-4bff-85a9-363d665c250f",
  "detail": "There was a conflict when trying to complete your
  request.\n\n Conflicting resource provider name: bagon.fqdn already
  exists.  ", "title": "Conflict"}]}.

  Full stack trace : https://pastebin.canonical.com/p/ynhpgsB8bp/
  (sorry, Canonical-only link)

  I tracked down the problem, and found it was due to the following
  mismatch :

  mysql> select uuid,host,deleted from compute_nodes where host='bagon';
  +--------------------------------------+-------+---------+
  | uuid                                 | host  | deleted |
  +--------------------------------------+-------+---------+
  | 2d236848-df06-47f1-92a4-a1afefe62931 | bagon |       0 |
  | 92232041-9767-466b-a82f-20ecef0af6fa | bagon |       9 |
  +--------------------------------------+-------+---------+
  2 rows in set (0.00 sec)

  mysql> use nova_api;
  Reading table information for completion of table and column names
  You can turn off this feature to get a quicker startup with -A

  Database changed
  mysql> select uuid,name from resource_providers where name like 'bagon%';
  +--------------------------------------+--------------------------+
  | uuid                                 | name                     |
  +--------------------------------------+--------------------------+
  | 92232041-9767-466b-a82f-20ecef0af6fa | bagon.fqdn               |
  +--------------------------------------+--------------------------+
  1 row in set (0.00 sec)

  The nova.compute_nodes table has 2 records for bagon, as expected :
  one is the old, deleted record and the other the current, live record.

  The problem, as you can see above, is that the
  nova_api.resource_providers table had the old UUID for bagon. I'm not
  exactly sure at what point nova-compute on bagon started failing, I'm
  fairly confident it was OK after the reinstall, so I suspect something
  happened during the migration from pike to queens.

  I manually updated the UUID in the resource_providers table, and bagon
  started working fine.

  I can't try to repro because I can't downgrade the cluster to try the
  pike=>queens upgrade a second time, but hopefully you can.

  Thanks !

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1762368/+subscriptions