[Bug 1969000] Re: [SRU] bail from handle_command() if _generate_command_map() fails
nikhil kshirsagar
1969000 at bugs.launchpad.net
Mon Mar 6 11:32:09 UTC 2023
https://github.com/ceph/ceph/pull/48845 and
https://github.com/ceph/ceph/pull/48846 are merged upstream. Once they
are available in the ceph Ubuntu pacific and quincy point releases this
Octopus SRU can be submitted.
(I will discuss this Thursday in ceph office hours if I should just SRU
this one-liner fix to Ubuntu P/Q if the point releases are some time
away)
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000
Title:
[SRU] bail from handle_command() if _generate_command_map() fails
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive ussuri series:
New
Status in ceph package in Ubuntu:
New
Status in ceph source package in Focal:
New
Status in ceph source package in Impish:
New
Status in ceph source package in Jammy:
New
Status in ceph source package in Kinetic:
New
Bug description:
[Impact]
If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.
[Test Plan]
Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.
curl -k -H "Authorization: Basic $TOKEN"
"https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
'{"prefix":"auth add","entity":"client.testuser02","caps":"mon
'\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'
The request status shows it is still in the queue if you check with
curl -k -X GET "$endpoint/request"
[
{
"failed": [],
"finished": [],
"has_failed": false,
"id": "140576245092648",
"is_finished": false,
"is_waiting": false,
"running": [
{
"command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
"outb": "",
"outs": ""
}
],
"state": "pending",
"waiting": []
}
]
This reproduces without restful API too.
Use this python script to reproduce the issue. Run it on the mon node,
root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
#!/usr/bin/env python3
import json
import rados
c = rados.Rados(conffile='/etc/ceph/ceph.conf')
c.connect()
cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
print(c.mon_command(cmd, b''))
root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
cluster:
id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
health: HEALTH_WARN
mon is allowing insecure global_id reclaim
1 monitors have not enabled msgr2
Reduced data availability: 69 pgs inactive
1921 daemons have recently crashed
services:
mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
osd: 3 osds: 3 up (since 22h), 3 in
data:
pools: 4 pools, 69 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
69 unknown
root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py
^C
(note the script hangs)
mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
then it seems like systemd restarts ceph, so ceph -s hangs for a while
then we see the restart messages like.
--- end dump of recent events ---
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2
While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
we need a cleanup fix that has now been also merged upstream - https://github.com/ceph/ceph/pull/45547
The cleanup fix bails out of the function if the exception is
thrown, therefore avoiding continuing in the function
void Monitor::handle_command(MonOpRequestRef op) in this
error situation.
[Where problems could occur]
The only potential problem with this cleanup fix is if
some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.
[Other Info]
Upstream tracker - https://tracker.ceph.com/issues/57859
Fixed in ceph main through https://github.com/ceph/ceph/pull/48044
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list