[Bug 1978913] Re: [SRU] ceph-osd takes all memory at boot
nikhil kshirsagar
1978913 at bugs.launchpad.net
Wed Nov 2 08:02:33 UTC 2022
** Description changed:
[Impact]
ceph-osd takes all memory at boot
[Test Plan]
- https://tracker.ceph.com/issues/53729
+ To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
+
+ #ps -eaf | grep osd
+ root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+ root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ kill all OSDs, so they're down,
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
+ 2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
+ 2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
+ cluster:
+ id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
+ health: HEALTH_WARN
+ 2 osds down
+ 1 host (3 osds) down
+ 1 root (3 osds) down
+ Reduced data availability: 169 pgs stale
+ Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
+
+ services:
+ mon: 3 daemons, quorum a,b,c (age 3s)
+ mgr: x(active, since 28h)
+ mds: a:1 {0=a=up:active}
+ osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
+ rgw: 1 daemon active (8000)
+
+ task status:
+
+ data:
+ pools: 7 pools, 169 pgs
+ objects: 255 objects, 9.5 KiB
+ usage: 4.1 GiB used, 198 GiB / 202 GiB avail
+ pgs: 255/765 objects degraded (33.333%)
+ 105 stale+active+undersized
+ 64 stale+active+undersized+degraded
+
+
+ Then inject dups using this json for all OSDs,
+
+ root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
+ [
+ {"reqid": "client.4177.0:0",
+ "version": "3'0",
+ "user_version": "0",
+ "generate": "50000000",
+ "return_code": "0"}
+ ]
+
+ Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter, to inject dups for all OSDs.
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ Then set osd debug level to 20 (since here is the log that actually doing the trim: https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138, so need debug_osd = 20)
+
+ set debug osd=20 in global in ceph.conf,
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
+ debug osd=20
+
+ Then bring up the OSDs
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ Run some IO on the OSDs. Wait at least a few hours.
+
+ Then take the OSDs down (so the command below can be run), and run,
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1e --op log > op.log
+
+ You will see at the end of that output in the file op.log, the number of
+ dups is still as it was when they were injected, (no trimming has taken
+ place)
+
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'499999",
+ "user_version": "0",
+ "return_code": "0"
+ },
+ {
+ "reqid": "client.4177.0:0", <-- note the id (4177)
+ "version": "3'500000", <---
+ "user_version": "0",
+ "return_code": "0"
+ }
+ ]
+ },
+ "pg_missing_t": {
+ "missing": [],
+ "may_include_deletes": true
+ }
+
+ To verify the patch:
+ With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
+
+
+ Then bring up the OSDs and start IO using rbd bench-write, leave the IO running a few hours, till these logs (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138) are seen as below, in the osd logs, with the same client ID (4177 in my example) as the one that the client that injected the dups had used,
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat
+ osd.1.log | grep -i "trim dup " | grep 4177 | more
+
+ 2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
+ ...
+ ...
+ 2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)
+
+ # grep -ri "trim dup " *.log | grep 4177 | wc -l
+ 390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
+
+ And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
+ --no-mon-config --pgid 2.1f --op log (you would need to take the
+ particular OSD down for verifying this) will show that the first bunch
+ of (130k for eg. here) dups have been trimmed already, see the
+ "version",
+
+ "dups": [
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'130001", <----
+ "user_version": "0",
+ "return_code": "0"
+ },
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'130002",
+ "user_version": "0",
+ "return_code": "0"
+ },
+
+ This will verify that the dups are being trimmed by the patch, and it is
+ working correctly. And of course, OSDs should not go OOM at boot time!
+
[Where problems could occur]
- Trimming large clusters could be time consuming.
+ This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
+
+ Also, an earlier attempt to fix this issue upstream was reverted, as
+ discussed at
+ https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1
+
+ While this fix has been tested and validated after building it into the
+ upstream 15.2.17 release (please see the [Test Plan] section), we would
+ still need to proceed with extreme caution by allowing some time for
+ problems (if any) to surface before going ahead with this SRU, and
+ running our QA tests on the packages that build this fix into the
+ 15.2.17 release before releasing it to the customer who await this fix
+ on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https://tracker.ceph.com/issues/53729 and fixed on
master through https://github.com/ceph/ceph/pull/47046
** Description changed:
[Impact]
ceph-osd takes all memory at boot
[Test Plan]
- To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
+ To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
#ps -eaf | grep osd
root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
-
- kill all OSDs, so they're down,
-
+
+ kill all OSDs, so they're down,
+
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
- cluster:
- id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
- health: HEALTH_WARN
- 2 osds down
- 1 host (3 osds) down
- 1 root (3 osds) down
- Reduced data availability: 169 pgs stale
- Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
-
- services:
- mon: 3 daemons, quorum a,b,c (age 3s)
- mgr: x(active, since 28h)
- mds: a:1 {0=a=up:active}
- osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
- rgw: 1 daemon active (8000)
-
- task status:
-
- data:
- pools: 7 pools, 169 pgs
- objects: 255 objects, 9.5 KiB
- usage: 4.1 GiB used, 198 GiB / 202 GiB avail
- pgs: 255/765 objects degraded (33.333%)
- 105 stale+active+undersized
- 64 stale+active+undersized+degraded
-
-
+ cluster:
+ id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
+ health: HEALTH_WARN
+ 2 osds down
+ 1 host (3 osds) down
+ 1 root (3 osds) down
+ Reduced data availability: 169 pgs stale
+ Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
+
+ services:
+ mon: 3 daemons, quorum a,b,c (age 3s)
+ mgr: x(active, since 28h)
+ mds: a:1 {0=a=up:active}
+ osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
+ rgw: 1 daemon active (8000)
+
+ task status:
+
+ data:
+ pools: 7 pools, 169 pgs
+ objects: 255 objects, 9.5 KiB
+ usage: 4.1 GiB used, 198 GiB / 202 GiB avail
+ pgs: 255/765 objects degraded (33.333%)
+ 105 stale+active+undersized
+ 64 stale+active+undersized+degraded
+
Then inject dups using this json for all OSDs,
-
- root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
+
+ root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
[
- {"reqid": "client.4177.0:0",
- "version": "3'0",
- "user_version": "0",
- "generate": "50000000",
- "return_code": "0"}
+ {"reqid": "client.4177.0:0",
+ "version": "3'0",
+ "user_version": "0",
+ "generate": "500000",
+ "return_code": "0"}
]
-
- Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter, to inject dups for all OSDs.
-
- root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
-
- root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
-
- root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-dups --file bin/dups.json --no-mon-config --pgid 2.1e
-
- Then set osd debug level to 20 (since here is the log that actually doing the trim: https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138, so need debug_osd = 20)
-
- set debug osd=20 in global in ceph.conf,
-
+
+ Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter,
+ to inject dups for all OSDs.
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
+ ./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-
+ dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
+ ./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-
+ dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
+ ./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-
+ dups --file bin/dups.json --no-mon-config --pgid 2.1e
+
+ Then set osd debug level to 20 (since here is the log that actually
+ doing the trim:
+ https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138,
+ so need debug_osd = 20)
+
+ set debug osd=20 in global in ceph.conf,
+
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
- debug osd=20
-
+ debug osd=20
+
Then bring up the OSDs
-
- /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
-
- /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
-
- /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
+
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c
+ /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
Run some IO on the OSDs. Wait at least a few hours.
Then take the OSDs down (so the command below can be run), and run,
-
- root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1e --op log > op.log
+
+ root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
+ ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid
+ 2.1e --op log > op.log
You will see at the end of that output in the file op.log, the number of
dups is still as it was when they were injected, (no trimming has taken
place)
- {
- "reqid": "client.4177.0:0",
- "version": "3'499999",
- "user_version": "0",
- "return_code": "0"
- },
- {
- "reqid": "client.4177.0:0", <-- note the id (4177)
- "version": "3'500000", <---
- "user_version": "0",
- "return_code": "0"
- }
- ]
- },
- "pg_missing_t": {
- "missing": [],
- "may_include_deletes": true
- }
-
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'499999",
+ "user_version": "0",
+ "return_code": "0"
+ },
+ {
+ "reqid": "client.4177.0:0", <-- note the id (4177)
+ "version": "3'500000", <---
+ "user_version": "0",
+ "return_code": "0"
+ }
+ ]
+ },
+ "pg_missing_t": {
+ "missing": [],
+ "may_include_deletes": true
+ }
+
To verify the patch:
- With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
+ With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
-
- Then bring up the OSDs and start IO using rbd bench-write, leave the IO running a few hours, till these logs (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138) are seen as below, in the osd logs, with the same client ID (4177 in my example) as the one that the client that injected the dups had used,
+ Then bring up the OSDs and start IO using rbd bench-write, leave the IO
+ running a few hours, till these logs
+ (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138)
+ are seen as below, in the osd logs, with the same client ID (4177 in my
+ example) as the one that the client that injected the dups had used,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat
osd.1.log | grep -i "trim dup " | grep 4177 | more
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
...
...
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)
# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
--no-mon-config --pgid 2.1f --op log (you would need to take the
particular OSD down for verifying this) will show that the first bunch
of (130k for eg. here) dups have been trimmed already, see the
"version",
- "dups": [
- {
- "reqid": "client.4177.0:0",
- "version": "3'130001", <----
- "user_version": "0",
- "return_code": "0"
- },
- {
- "reqid": "client.4177.0:0",
- "version": "3'130002",
- "user_version": "0",
- "return_code": "0"
- },
+ "dups": [
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'130001", <----
+ "user_version": "0",
+ "return_code": "0"
+ },
+ {
+ "reqid": "client.4177.0:0",
+ "version": "3'130002",
+ "user_version": "0",
+ "return_code": "0"
+ },
This will verify that the dups are being trimmed by the patch, and it is
working correctly. And of course, OSDs should not go OOM at boot time!
-
[Where problems could occur]
- This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
+ This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
Also, an earlier attempt to fix this issue upstream was reverted, as
discussed at
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1
While this fix has been tested and validated after building it into the
upstream 15.2.17 release (please see the [Test Plan] section), we would
still need to proceed with extreme caution by allowing some time for
problems (if any) to surface before going ahead with this SRU, and
running our QA tests on the packages that build this fix into the
15.2.17 release before releasing it to the customer who await this fix
on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https://tracker.ceph.com/issues/53729 and fixed on
master through https://github.com/ceph/ceph/pull/47046
** Description changed:
[Impact]
ceph-osd takes all memory at boot
[Test Plan]
To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
#ps -eaf | grep osd
root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
kill all OSDs, so they're down,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
cluster:
id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
health: HEALTH_WARN
2 osds down
1 host (3 osds) down
1 root (3 osds) down
Reduced data availability: 169 pgs stale
Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 3s)
mgr: x(active, since 28h)
mds: a:1 {0=a=up:active}
osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
rgw: 1 daemon active (8000)
task status:
data:
pools: 7 pools, 169 pgs
objects: 255 objects, 9.5 KiB
usage: 4.1 GiB used, 198 GiB / 202 GiB avail
pgs: 255/765 objects degraded (33.333%)
105 stale+active+undersized
64 stale+active+undersized+degraded
Then inject dups using this json for all OSDs,
root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
[
{"reqid": "client.4177.0:0",
"version": "3'0",
"user_version": "0",
"generate": "500000",
"return_code": "0"}
]
Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter,
to inject dups for all OSDs.
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
Then set osd debug level to 20 (since here is the log that actually
doing the trim:
https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138,
so need debug_osd = 20)
set debug osd=20 in global in ceph.conf,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
debug osd=20
Then bring up the OSDs
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
Run some IO on the OSDs. Wait at least a few hours.
Then take the OSDs down (so the command below can be run), and run,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid
2.1e --op log > op.log
You will see at the end of that output in the file op.log, the number of
dups is still as it was when they were injected, (no trimming has taken
place)
{
"reqid": "client.4177.0:0",
"version": "3'499999",
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0", <-- note the id (4177)
"version": "3'500000", <---
"user_version": "0",
"return_code": "0"
}
]
},
"pg_missing_t": {
"missing": [],
"may_include_deletes": true
}
To verify the patch:
With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
Then bring up the OSDs and start IO using rbd bench-write, leave the IO
running a few hours, till these logs
(https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138)
are seen as below, in the osd logs, with the same client ID (4177 in my
example) as the one that the client that injected the dups had used,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat
osd.1.log | grep -i "trim dup " | grep 4177 | more
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
...
...
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)
# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
--no-mon-config --pgid 2.1f --op log (you would need to take the
particular OSD down for verifying this) will show that the first bunch
of (130k for eg. here) dups have been trimmed already, see the
- "version",
+ "version", which starts with the figure 3'130001 instead of 0 now,
"dups": [
{
"reqid": "client.4177.0:0",
"version": "3'130001", <----
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0",
"version": "3'130002",
"user_version": "0",
"return_code": "0"
},
This will verify that the dups are being trimmed by the patch, and it is
working correctly. And of course, OSDs should not go OOM at boot time!
[Where problems could occur]
This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
Also, an earlier attempt to fix this issue upstream was reverted, as
discussed at
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1
While this fix has been tested and validated after building it into the
upstream 15.2.17 release (please see the [Test Plan] section), we would
still need to proceed with extreme caution by allowing some time for
problems (if any) to surface before going ahead with this SRU, and
running our QA tests on the packages that build this fix into the
15.2.17 release before releasing it to the customer who await this fix
on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https://tracker.ceph.com/issues/53729 and fixed on
master through https://github.com/ceph/ceph/pull/47046
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1978913
Title:
[SRU] ceph-osd takes all memory at boot
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive queens series:
New
Status in Ubuntu Cloud Archive ussuri series:
New
Status in Ubuntu Cloud Archive wallaby series:
New
Status in Ubuntu Cloud Archive xena series:
New
Status in Ubuntu Cloud Archive yoga series:
New
Status in ceph package in Ubuntu:
New
Status in ceph source package in Bionic:
New
Status in ceph source package in Focal:
New
Status in ceph source package in Jammy:
New
Status in ceph source package in Kinetic:
New
Bug description:
[Impact]
ceph-osd takes all memory at boot
[Test Plan]
To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
#ps -eaf | grep osd
root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
kill all OSDs, so they're down,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
cluster:
id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
health: HEALTH_WARN
2 osds down
1 host (3 osds) down
1 root (3 osds) down
Reduced data availability: 169 pgs stale
Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 3s)
mgr: x(active, since 28h)
mds: a:1 {0=a=up:active}
osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
rgw: 1 daemon active (8000)
task status:
data:
pools: 7 pools, 169 pgs
objects: 255 objects, 9.5 KiB
usage: 4.1 GiB used, 198 GiB / 202 GiB avail
pgs: 255/765 objects degraded (33.333%)
105 stale+active+undersized
64 stale+active+undersized+degraded
Then inject dups using this json for all OSDs,
root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
[
{"reqid": "client.4177.0:0",
"version": "3'0",
"user_version": "0",
"generate": "500000",
"return_code": "0"}
]
Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter,
to inject dups for all OSDs.
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
Then set osd debug level to 20 (since here is the log that actually
doing the trim:
https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138,
so need debug_osd = 20)
set debug osd=20 in global in ceph.conf,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
debug osd=20
Then bring up the OSDs
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
Run some IO on the OSDs. Wait at least a few hours.
Then take the OSDs down (so the command below can be run), and run,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config
--pgid 2.1e --op log > op.log
You will see at the end of that output in the file op.log, the number
of dups is still as it was when they were injected, (no trimming has
taken place)
{
"reqid": "client.4177.0:0",
"version": "3'499999",
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0", <-- note the id (4177)
"version": "3'500000", <---
"user_version": "0",
"return_code": "0"
}
]
},
"pg_missing_t": {
"missing": [],
"may_include_deletes": true
}
To verify the patch:
With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
Then bring up the OSDs and start IO using rbd bench-write, leave the
IO running a few hours, till these logs
(https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138)
are seen as below, in the osd logs, with the same client ID (4177 in
my example) as the one that the client that injected the dups had
used,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out#
cat osd.1.log | grep -i "trim dup " | grep 4177 | more
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
...
...
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)
# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
--no-mon-config --pgid 2.1f --op log (you would need to take the
particular OSD down for verifying this) will show that the first bunch
of (130k for eg. here) dups have been trimmed already, see the
"version", which starts with the figure 3'130001 instead of 0 now,
"dups": [
{
"reqid": "client.4177.0:0",
"version": "3'130001", <----
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0",
"version": "3'130002",
"user_version": "0",
"return_code": "0"
},
This will verify that the dups are being trimmed by the patch, and it
is working correctly. And of course, OSDs should not go OOM at boot
time!
[Where problems could occur]
This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
Also, an earlier attempt to fix this issue upstream was reverted, as
discussed at
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1
While this fix has been tested and validated after building it into
the upstream 15.2.17 release (please see the [Test Plan] section), we
would still need to proceed with extreme caution by allowing some time
for problems (if any) to surface before going ahead with this SRU, and
running our QA tests on the packages that build this fix into the
15.2.17 release before releasing it to the customer who await this fix
on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https://tracker.ceph.com/issues/53729 and fixed
on master through https://github.com/ceph/ceph/pull/47046
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1978913/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list