[Bug 1978913] Re: [SRU] ceph-osd takes all memory at boot
Mauricio Faria de Oliveira
1978913 at bugs.launchpad.net
Wed May 10 13:05:52 UTC 2023
Reviewed and uploaded to Focal on top of the recent security upload.
The previous upload has been rejected from the queue (thanks, Robie).
For documentation purposes:
- Original upload (15.2.17-0ubuntu0.20.04.2) was accepted into -proposed
(see [1] and comment #11), but was deleted later due to timing/issues
with the verification tags and test steps (comments #15, #20).
- Another upload (15.2.17-0ubuntu0.20.04.3) was made, but was trumped by
the security upload (same version number); it also had some differences
in the debdiff (e.g., DEP3 headers, changelog) to the original upload.
It's now rejected from the queued.
- This upload (15.2.17-0ubuntu0.20.04.4) is on top of the security upload
(<version>.3), and has the _original_ changes (<version>.2), which have
already been approved once (this should help), plus adjustments to DEP3.
This has been build-tested locally on amd64, for time/performance
reasons.
Attaching the debdiff (without the build-generated noise in src/test dir)
for reference purposes.
Thanks!
[1] https://launchpad.net/ubuntu/+source/ceph/15.2.17-0ubuntu0.20.04.2
** Changed in: ceph (Ubuntu Focal)
Status: Confirmed => In Progress
** Changed in: ceph (Ubuntu Focal)
Assignee: (unassigned) => nikhil kshirsagar (nkshirsagar)
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1978913
Title:
[SRU] ceph-osd takes all memory at boot
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive queens series:
New
Status in Ubuntu Cloud Archive ussuri series:
Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
Invalid
Status in Ubuntu Cloud Archive xena series:
Invalid
Status in Ubuntu Cloud Archive yoga series:
Invalid
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Bionic:
New
Status in ceph source package in Focal:
In Progress
Status in ceph source package in Jammy:
Invalid
Status in ceph source package in Kinetic:
Invalid
Bug description:
[Impact]
The OSD will fail to trim the pg log dup entries, which could result in millions of dup entries for a PG while it was supposed to be at most 3000 (controlled by option osd_pg_log_dups_tracked).
This could cause OSD to run out of memory and crash, and it might not
be able to start up again due to the need of loading millions of dup
entries. This could happen to multiple OSDs at the same time (as also
reported by many community users), so we may get a completely unusable
cluster if we hit this issue.
The current known trigger for this problem is the pg split, as the whole dup entries will be copied during the pg split. The reason we don’t observe this so often before is that the pg autoscale wasn’t turned on by default, it’s on by default since from octopus.
Note that there is also no way to check the number of dups in a PG
online.
[Test Plan]
To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
#ps -eaf | grep osd
root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
kill all OSDs, so they're down,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s
2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled.
cluster:
id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e
health: HEALTH_WARN
2 osds down
1 host (3 osds) down
1 root (3 osds) down
Reduced data availability: 169 pgs stale
Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 3s)
mgr: x(active, since 28h)
mds: a:1 {0=a=up:active}
osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
rgw: 1 daemon active (8000)
task status:
data:
pools: 7 pools, 169 pgs
objects: 255 objects, 9.5 KiB
usage: 4.1 GiB used, 198 GiB / 202 GiB avail
pgs: 255/765 objects degraded (33.333%)
105 stale+active+undersized
64 stale+active+undersized+degraded
Then inject dups using this json for all OSDs,
root at nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json
[
{"reqid": "client.4177.0:0",
"version": "3'0",
"user_version": "0",
"generate": "500000",
"return_code": "0"}
]
Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter,
to inject dups for all OSDs.
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject-
dups --file bin/dups.json --no-mon-config --pgid 2.1e
Then set osd debug level to 20 (since here is the log that actually
doing the trim:
https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138,
so need debug_osd = 20)
set debug osd=20 in global in ceph.conf,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd"
debug osd=20
Then bring up the OSDs
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
/home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c
/home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf
Run some IO on the OSDs. Wait at least a few hours.
Then take the OSDs down (so the command below can be run), and run,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build#
./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config
--pgid 2.1e --op log > op.log
You will see at the end of that output in the file op.log, the number
of dups is still as it was when they were injected, (no trimming has
taken place)
{
"reqid": "client.4177.0:0",
"version": "3'499999",
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0", <-- note the id (4177)
"version": "3'500000", <---
"user_version": "0",
"return_code": "0"
}
]
},
"pg_missing_t": {
"missing": [],
"may_include_deletes": true
}
To verify the patch:
With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before).
Then bring up the OSDs and start IO using rbd bench-write, leave the
IO running a few hours, till these logs
(https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138)
are seen as below, in the osd logs, with the same client ID (4177 in
my example) as the one that the client that injected the dups had
used,
root at focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out#
cat osd.1.log | grep -i "trim dup " | grep 4177 | more
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0)
...
...
2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0)
# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/
--no-mon-config --pgid 2.1f --op log (you would need to take the
particular OSD down for verifying this) will show that the first bunch
of (130k for eg. here) dups have been trimmed already, see the
"version", which starts with the figure 3'130001 instead of 0 now,
"dups": [
{
"reqid": "client.4177.0:0",
"version": "3'130001", <----
"user_version": "0",
"return_code": "0"
},
{
"reqid": "client.4177.0:0",
"version": "3'130002",
"user_version": "0",
"return_code": "0"
},
This will verify that the dups are being trimmed by the patch, and it
is working correctly. And of course, OSDs should not go OOM at boot
time!
[Where problems could occur]
This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126).
Also, an earlier attempt to fix this issue upstream was reverted, as
discussed at
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1
While this fix has been tested and validated after building it into
the upstream 15.2.17 release (please see the [Test Plan] section), we
would still need to proceed with extreme caution by allowing some time
for problems (if any) to surface before going ahead with this SRU, and
running our QA tests on the packages that build this fix into the
15.2.17 release before releasing it to the customer who await this fix
on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https://tracker.ceph.com/issues/53729 and fixed
on master through https://github.com/ceph/ceph/pull/47046
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1978913/+subscriptions
More information about the Ubuntu-sponsors
mailing list