[Bug 2091947] Re: [SRU] Watcher crashes on creation of multiple audits and gets stuck in PENDING
Bryan Fraschetti
2091947 at bugs.launchpad.net
Thu Jan 23 22:06:35 UTC 2025
** Description changed:
+ [ Impact ]
+
+ * The watcher releases targeted by this SRU are experiencing a bug
+ where you can only create one audit of type CONTINUOUS. Any subsequently
+ created audits end up getting stuck in a pending state. The root cause
+ of this error is the conversion of an improperly typed date which causes
+ watcher to crash. The function converting the date format,
+ utc_timestamp_to_datetime, expects the timestamp to be of type float but
+ Watcher has been passing the date as a decimal object. The patch at [1]
+ correctly typecasts to float before converting to a datetime object
+
+ * The commit landed upstream in 2024.2.
+
+ [ Test Plan ]
+
+ * Deploy openstack yoga on jammy with watcher and gnocchi services
+
+ * Create two watcher audits of CONTINUOUS type and monitor their status
+ openstack optimize audit create --name test_audit_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
+ openstack optimize audit create --name test_audit_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
+ openstack optimize audit list
+
+ * Without the patch, the second audit will get stuch in state PENDING
+ and systemctl status watcher-decision-engine.service reveals that a
+ crash occured. With the patch, both audits successfully enter a state of
+ "ONGOING"
+
+ [ What can go wrong ]
+
+ * This commit overrides the apscheduler's implementation of
+ get_next_run_time, since the apscheduler's implementation obtains the
+ decimal.Decimal object which crashes the engine. This should expand
+ compatibility to include SQLAlchemy 2.0 but may have otherwise have
+ effects. It shouldn't since the function it's overriding is what
+ precipitates the issue but it may affect legacy software (eg. older
+ SQLAlchemy)
+
+ [1]
+ https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
+
+
+ --------------------------------------
+ Original Description:
+
A customer is facing an issue where the watcher-decision-engine service
crashes when creating an audit plan with the Audit type set to
CONTINUOUS. Below are the steps to reproduce the issue:
Environment Details:
1. Deploy Openstack Yoga on Jammy with Watcher and Gnocchi as watcher's storage backend
2. Create an audit
openstack optimize audit create --name workload_stabilization_test_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
3. Check the audit state
openstack optimize audit list
Observe it says "CONTINUOUS ONGOING"
4. Create a second audit
openstack optimize audit create --name workload_stabilization_test_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
5. Check the audit state
openstack optimize audit list
Observe the second audit is stuck in "CONTINUOUS PENDING"
6. Check watcher's status and observe that it crashed with the following traceback
systemctl status watcher-decision-engine.service
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self.run()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3.10/threading.py", line 953, in run
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self._target(*self._args, **self._kwargs)
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/blocking.py", line 32, in _main_loop
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: wait_seconds = self._process_jobs()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/base.py", line 1006, in _process_jobs
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: jobstore_next_run_time = jobstore.get_next_run_time()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/jobstores/sqlalchemy.py", line 84, in get_next_run_time
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: return utc_timestamp_to_datetime(float(next_run_time))
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: TypeError: float() argument must be a string or a real number, not 'NoneType'
This was fixed upstream in 2024.2 at
https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
which properly addresses the type conversion and
https://opendev.org/openstack/watcher/commit/fbb290b2238e9e72054892e9ae6108a8907f47d7
which adjusts the unit tests to support croniter 5.0.0+, which is the
default installed by tox on Noble and Oracular since they are shipped
with Python3.12.
** Description changed:
[ Impact ]
- * The watcher releases targeted by this SRU are experiencing a bug
+ * The watcher releases targeted by this SRU are experiencing a bug
where you can only create one audit of type CONTINUOUS. Any subsequently
created audits end up getting stuck in a pending state. The root cause
of this error is the conversion of an improperly typed date which causes
watcher to crash. The function converting the date format,
utc_timestamp_to_datetime, expects the timestamp to be of type float but
Watcher has been passing the date as a decimal object. The patch at [1]
correctly typecasts to float before converting to a datetime object
- * The commit landed upstream in 2024.2.
+ * The commit landed upstream in 2024.2.
[ Test Plan ]
- * Deploy openstack yoga on jammy with watcher and gnocchi services
+ * Deploy openstack yoga on jammy with watcher and gnocchi services
- * Create two watcher audits of CONTINUOUS type and monitor their status
- openstack optimize audit create --name test_audit_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
- openstack optimize audit create --name test_audit_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
- openstack optimize audit list
+ * Create two watcher audits of CONTINUOUS type and monitor their status
+ openstack optimize audit create --name test_audit_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
+ openstack optimize audit create --name test_audit_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
+ openstack optimize audit list
- * Without the patch, the second audit will get stuch in state PENDING
+ * Without the patch, the second audit will get stuch in state PENDING
and systemctl status watcher-decision-engine.service reveals that a
crash occured. With the patch, both audits successfully enter a state of
"ONGOING"
[ What can go wrong ]
- * This commit overrides the apscheduler's implementation of
+ * This commit overrides the apscheduler's implementation of
get_next_run_time, since the apscheduler's implementation obtains the
decimal.Decimal object which crashes the engine. This should expand
compatibility to include SQLAlchemy 2.0 but may have otherwise have
effects. It shouldn't since the function it's overriding is what
precipitates the issue but it may affect legacy software (eg. older
SQLAlchemy)
[1]
https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
-
--------------------------------------
Original Description:
A customer is facing an issue where the watcher-decision-engine service
crashes when creating an audit plan with the Audit type set to
CONTINUOUS. Below are the steps to reproduce the issue:
Environment Details:
1. Deploy Openstack Yoga on Jammy with Watcher and Gnocchi as watcher's storage backend
2. Create an audit
openstack optimize audit create --name workload_stabilization_test_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
3. Check the audit state
openstack optimize audit list
Observe it says "CONTINUOUS ONGOING"
4. Create a second audit
openstack optimize audit create --name workload_stabilization_test_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
5. Check the audit state
openstack optimize audit list
Observe the second audit is stuck in "CONTINUOUS PENDING"
6. Check watcher's status and observe that it crashed with the following traceback
systemctl status watcher-decision-engine.service
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self.run()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3.10/threading.py", line 953, in run
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self._target(*self._args, **self._kwargs)
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/blocking.py", line 32, in _main_loop
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: wait_seconds = self._process_jobs()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/base.py", line 1006, in _process_jobs
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: jobstore_next_run_time = jobstore.get_next_run_time()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/jobstores/sqlalchemy.py", line 84, in get_next_run_time
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: return utc_timestamp_to_datetime(float(next_run_time))
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: TypeError: float() argument must be a string or a real number, not 'NoneType'
This was fixed upstream in 2024.2 at
https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
- which properly addresses the type conversion and
- https://opendev.org/openstack/watcher/commit/fbb290b2238e9e72054892e9ae6108a8907f47d7
- which adjusts the unit tests to support croniter 5.0.0+, which is the
- default installed by tox on Noble and Oracular since they are shipped
- with Python3.12.
+ which properly addresses the type conversion
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2091947
Title:
[SRU] Watcher crashes on creation of multiple audits and gets stuck in
PENDING
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive antelope series:
New
Status in Ubuntu Cloud Archive bobcat series:
New
Status in Ubuntu Cloud Archive caracal series:
New
Status in Ubuntu Cloud Archive dalmatian series:
Fix Released
Status in Ubuntu Cloud Archive epoxy series:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
New
Status in Ubuntu Cloud Archive zed series:
New
Status in watcher package in Ubuntu:
Fix Released
Status in watcher source package in Focal:
Confirmed
Status in watcher source package in Jammy:
Confirmed
Status in watcher source package in Noble:
Confirmed
Status in watcher source package in Oracular:
Fix Released
Status in watcher source package in Plucky:
Fix Released
Bug description:
[ Impact ]
* The watcher releases targeted by this SRU are experiencing a bug
where you can only create one audit of type CONTINUOUS. Any
subsequently created audits end up getting stuck in a pending state.
The root cause of this error is the conversion of an improperly typed
date which causes watcher to crash. The function converting the date
format, utc_timestamp_to_datetime, expects the timestamp to be of type
float but Watcher has been passing the date as a decimal object. The
patch at [1] correctly typecasts to float before converting to a
datetime object
* The commit landed upstream in 2024.2.
[ Test Plan ]
* Deploy openstack yoga on jammy with watcher and gnocchi services
* Create two watcher audits of CONTINUOUS type and monitor their status
openstack optimize audit create --name test_audit_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
openstack optimize audit create --name test_audit_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
openstack optimize audit list
* Without the patch, the second audit will get stuch in state
PENDING and systemctl status watcher-decision-engine.service reveals
that a crash occured. With the patch, both audits successfully enter a
state of "ONGOING"
[ What can go wrong ]
* This commit overrides the apscheduler's implementation of
get_next_run_time, since the apscheduler's implementation obtains the
decimal.Decimal object which crashes the engine. This should expand
compatibility to include SQLAlchemy 2.0 but may have otherwise have
effects. It shouldn't since the function it's overriding is what
precipitates the issue but it may affect legacy software (eg. older
SQLAlchemy)
[1]
https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
--------------------------------------
Original Description:
A customer is facing an issue where the watcher-decision-engine
service crashes when creating an audit plan with the Audit type set to
CONTINUOUS. Below are the steps to reproduce the issue:
Environment Details:
1. Deploy Openstack Yoga on Jammy with Watcher and Gnocchi as watcher's storage backend
2. Create an audit
openstack optimize audit create --name workload_stabilization_test_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
3. Check the audit state
openstack optimize audit list
Observe it says "CONTINUOUS ONGOING"
4. Create a second audit
openstack optimize audit create --name workload_stabilization_test_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
5. Check the audit state
openstack optimize audit list
Observe the second audit is stuck in "CONTINUOUS PENDING"
6. Check watcher's status and observe that it crashed with the following traceback
systemctl status watcher-decision-engine.service
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self.run()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3.10/threading.py", line 953, in run
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self._target(*self._args, **self._kwargs)
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/blocking.py", line 32, in _main_loop
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: wait_seconds = self._process_jobs()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/base.py", line 1006, in _process_jobs
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: jobstore_next_run_time = jobstore.get_next_run_time()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/jobstores/sqlalchemy.py", line 84, in get_next_run_time
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: return utc_timestamp_to_datetime(float(next_run_time))
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: TypeError: float() argument must be a string or a real number, not 'NoneType'
This was fixed upstream in 2024.2 at
https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
which properly addresses the type conversion
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091947/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list