[Bug 1969000] Re: [SRU] bail from handle_command() if _generate_command_map() fails
nikhil kshirsagar
1969000 at bugs.launchpad.net
Tue Aug 8 06:17:26 UTC 2023
** Tags added: se-sponsors
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000
Title:
[SRU] bail from handle_command() if _generate_command_map() fails
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive ussuri series:
New
Status in ceph package in Ubuntu:
New
Status in ceph source package in Focal:
New
Status in ceph source package in Impish:
New
Status in ceph source package in Jammy:
New
Status in ceph source package in Kinetic:
New
Bug description:
[Impact]
If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.
[Test Plan]
Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.
curl -k -H "Authorization: Basic $TOKEN"
"https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
'{"prefix":"auth add","entity":"client.testuser02","caps":"mon
'\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'
The request status shows it is still in the queue if you check with
curl -k -X GET "$endpoint/request"
[
{
"failed": [],
"finished": [],
"has_failed": false,
"id": "140576245092648",
"is_finished": false,
"is_waiting": false,
"running": [
{
"command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
"outb": "",
"outs": ""
}
],
"state": "pending",
"waiting": []
}
]
This reproduces without restful API too.
Use this python script to reproduce the issue. Run it on the mon node,
root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
#!/usr/bin/env python3
import json
import rados
c = rados.Rados(conffile='/etc/ceph/ceph.conf')
c.connect()
cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
print(c.mon_command(cmd, b''))
root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
cluster:
id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
health: HEALTH_WARN
mon is allowing insecure global_id reclaim
1 monitors have not enabled msgr2
Reduced data availability: 69 pgs inactive
1921 daemons have recently crashed
services:
mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
osd: 3 osds: 3 up (since 22h), 3 in
data:
pools: 4 pools, 69 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
69 unknown
root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py
^C
(note the script hangs)
mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
then it seems like systemd restarts ceph, so ceph -s hangs for a while
then we see the restart messages like.
--- end dump of recent events ---
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2
While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
we need a cleanup fix that has now been also merged upstream - https://github.com/ceph/ceph/pull/45547
The cleanup fix bails out of the function if the exception is
thrown, therefore avoiding continuing in the function
void Monitor::handle_command(MonOpRequestRef op) in this
error situation.
[Where problems could occur]
The only potential problem with this cleanup fix is if
some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.
[Other Info]
Upstream tracker - https://tracker.ceph.com/issues/57859
Fixed in ceph main through https://github.com/ceph/ceph/pull/48044
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list