follow-up to osd(s) with unlimited ram growth
There is a way to check a ceph cluster if there are any OSDs affected by the "PG Dup Bug" by running following command:
ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.[0-9]'
This will provide you a list of all OSDs in the cluster containing 2 Parameters:
- osd_pglog_bytes
- osd_pglog_items
"osd_pglog_items" counter is a sum of "normal" log entries, dup entries and some other things. Taking that osd_target_pg_log_entries_per_osd is 300.000 by default, we may assume that about 300.000 items are "normal" pg log entries, and if "osd_pglog_items" counter is much higher than this it is most likely due to dups. Example:
osd.32: { "osd_pglog_bytes": 1925908608, "osd_pglog_items": 17418324 }
osd_pglog_items = 17.418.324 - 300.000 = probably about 17 Million PG Dups
Running a manual check against this OSD with the commands in with unlimited ram growth revealed 1 PG with 17.090.093 entries. So this is a quick and easy way to identify problematic OSD(s) without the need to stop all OSDs and manually run commands.
Sources:
github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L2951