[Bug 1969000] Re: [SRU] mon crashes when improper json is passed to rados

Thu Jun 16 10:46:38 UTC 2022

** Description changed:

  [Impact]
- If improper json data is passed to rados, it can end up crashing the mon.
+ If improper json data is passed to rados using a manual curl command, it can end up crashing the mon.

  [Test Plan]
- A manual run of curl with the malformed request like this results in the crash -
+ Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d '{"prefix":"auth
  add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd
  '\''allow rw pool=testpool01'\''"}'

- The request status shows it is still in the queue.
+ The request status shows it is still in the queue if you check with curl
+ -k -X GET "$endpoint/request"

  [
      {
          "failed": [],
          "finished": [],
          "has_failed": false,
          "id": "140576245092648",
          "is_finished": false,
          "is_waiting": false,
          "running": [
              {
                  "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
                  "outb": "",
                  "outs": ""
              }
          ],
          "state": "pending",
          "waiting": []
      }
  ]

+ This reproduces without restful API too.
+ 
+ Use this python script to reproduce the issue. Run it on the mon node,
+ 
+ root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
+ #!/usr/bin/env python3
+ import json
+ import rados
+ c = rados.Rados(conffile='/etc/ceph/ceph.conf')
+ c.connect()
+ cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
+ print(c.mon_command(cmd, b''))
+ 
+ root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
+ cluster:
+ id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
+ health: HEALTH_WARN
+ mon is allowing insecure global_id reclaim
+ 1 monitors have not enabled msgr2
+ Reduced data availability: 69 pgs inactive
+ 1921 daemons have recently crashed
+ 
+ services:
+ mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
+ mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
+ osd: 3 osds: 3 up (since 22h), 3 in
+ 
+ data:
+ pools: 4 pools, 69 pgs
+ objects: 0 objects, 0 B
+ usage: 0 B used, 0 B / 0 B avail
+ pgs: 100.000% pgs unknown
+ 69 unknown
+ 
+ root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py
+ 
+ ^C
+ 
+ (note the script hangs)
+ 
+ mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
+ then it seems like systemd restarts ceph, so ceph -s hangs for a while
+ then we see the restart messages like.
+ 
+ --- end dump of recent events ---
+ 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
+ 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
+ 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
+ 2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
+ 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2
+ 
  [Where problems could occur]
- No problems foreseen because the exception is hit only in case of malformed json data, and not otherwise, and it is a desirable thing to catch and handle it instead of allowing process termination due to uncaught exception.
+ If there is malformed input data like the json in the python script,
+ that causes exceptions in other parts of the code which are not caught
+ by this fix, there could still be a termination or crash in those situations.

  [Other Info]
  Reported upstream at https://tracker.ceph.com/issues/54558 (including reproducer, and fix testing details) and fixed through https://github.com/ceph/ceph/pull/45547

  PR for Octopus is at https://github.com/ceph/ceph/pull/45891

** Description changed:

  [Impact]
- If improper json data is passed to rados using a manual curl command, it can end up crashing the mon.
+ If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.

  [Test Plan]
  Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d '{"prefix":"auth
  add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd
  '\''allow rw pool=testpool01'\''"}'

  The request status shows it is still in the queue if you check with curl
  -k -X GET "$endpoint/request"

  [
      {
          "failed": [],
          "finished": [],
          "has_failed": false,
          "id": "140576245092648",
          "is_finished": false,
          "is_waiting": false,
          "running": [
              {
                  "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
                  "outb": "",
                  "outs": ""
              }
          ],
          "state": "pending",
          "waiting": []
      }
  ]

  This reproduces without restful API too.

  Use this python script to reproduce the issue. Run it on the mon node,

  root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
  #!/usr/bin/env python3
  import json
  import rados
  c = rados.Rados(conffile='/etc/ceph/ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
  print(c.mon_command(cmd, b''))

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
  cluster:
  id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
  health: HEALTH_WARN
  mon is allowing insecure global_id reclaim
  1 monitors have not enabled msgr2
  Reduced data availability: 69 pgs inactive
  1921 daemons have recently crashed

  services:
  mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
  mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
  osd: 3 osds: 3 up (since 22h), 3 in

  data:
  pools: 4 pools, 69 pgs
  objects: 0 objects, 0 B
  usage: 0 B used, 0 B / 0 B avail
  pgs: 100.000% pgs unknown
  69 unknown

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py

  ^C

  (note the script hangs)

  mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
  then it seems like systemd restarts ceph, so ceph -s hangs for a while
  then we see the restart messages like.

  --- end dump of recent events ---
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
  2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2

  [Where problems could occur]
  If there is malformed input data like the json in the python script,
  that causes exceptions in other parts of the code which are not caught
  by this fix, there could still be a termination or crash in those situations.

  [Other Info]
  Reported upstream at https://tracker.ceph.com/issues/54558 (including reproducer, and fix testing details) and fixed through https://github.com/ceph/ceph/pull/45547

  PR for Octopus is at https://github.com/ceph/ceph/pull/45891

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000

Title:
  [SRU] mon crashes when improper json is passed to rados

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in ceph package in Ubuntu:
  New
Status in ceph source package in Focal:
  New

Bug description:
  [Impact]
  If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.

  [Test Plan]
  Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
  '{"prefix":"auth add","entity":"client.testuser02","caps":"mon
  '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'

  The request status shows it is still in the queue if you check with
  curl -k -X GET "$endpoint/request"

  [
      {
          "failed": [],
          "finished": [],
          "has_failed": false,
          "id": "140576245092648",
          "is_finished": false,
          "is_waiting": false,
          "running": [
              {
                  "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
                  "outb": "",
                  "outs": ""
              }
          ],
          "state": "pending",
          "waiting": []
      }
  ]

  This reproduces without restful API too.

  Use this python script to reproduce the issue. Run it on the mon node,

  root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
  #!/usr/bin/env python3
  import json
  import rados
  c = rados.Rados(conffile='/etc/ceph/ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
  print(c.mon_command(cmd, b''))

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
  cluster:
  id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
  health: HEALTH_WARN
  mon is allowing insecure global_id reclaim
  1 monitors have not enabled msgr2
  Reduced data availability: 69 pgs inactive
  1921 daemons have recently crashed

  services:
  mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
  mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
  osd: 3 osds: 3 up (since 22h), 3 in

  data:
  pools: 4 pools, 69 pgs
  objects: 0 objects, 0 B
  usage: 0 B used, 0 B / 0 B avail
  pgs: 100.000% pgs unknown
  69 unknown

  root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py

  ^C

  (note the script hangs)

  mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
  then it seems like systemd restarts ceph, so ceph -s hangs for a while
  then we see the restart messages like.

  --- end dump of recent events ---
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
  2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
  2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
  2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2

  [Where problems could occur]
  If there is malformed input data like the json in the python script,
  that causes exceptions in other parts of the code which are not caught
  by this fix, there could still be a termination or crash in those situations.

  [Other Info]
  Reported upstream at https://tracker.ceph.com/issues/54558 (including reproducer, and fix testing details) and fixed through https://github.com/ceph/ceph/pull/45547

  PR for Octopus is at https://github.com/ceph/ceph/pull/45891

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions