We have been working with Ceph for more than 10 years and have observed the following behaviour of osds on different Ceph clusters for several years:
- osds start very slowly, up to 10 minutes until ceph osd up was reported
- increased osd memory consumption up to 8GB, with default osd_memory target of 4GB
- individual hosts that wanted to consume so much main memory that the OOM killer of the Linux kernel terminated the process. Main memory consumption of individual osds tested up to 150 GB.
- Complete Ceph cluster with all osds that could no longer be started successfully because all osds wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.
There were also many messages in the Ceph mailing list about similar problems, which were analysed extensively. However, they never tracked down the root cause of the errors. Often the problem simply resolved itself or the affected osds were removed and reinstalled.
Together with affected people and colleagues from the community, we were now able to permanently investigate the bug and found the root cause.
tracker.ceph.com/issues/53729
Root Cause
There are dup entries with a version higher than the log entries. This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. tracker.ceph.com/issues/53729#note-57
Affected Ceph versions: all versions that have dups (jewel or luminous and later).
Possible explanation
Possible explanation why we can observe it more and more often is that the autoscaler was not active by default.
Our experience of the last weeks is that we find osds with several million pg dup entries on many Ceph clusters we maintain. The current peak is osds with over 50 million entries (octopus release).
Mitigation
We have developed some tools to mitigate the problem.
This is ceph-objectstore-tool built with the patches from PR github.com/ceph/ceph/pull/45529 See built packages at shaman.ceph.com/repos/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ E.g. this is for bionic 2.chacra.ceph.com/r/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ubuntu/bionic/flavors/default/
Note: Shaman is used for testing on teuthology and I am not sure how long the packages remain available there.
Mitigation process
-
Identify OSD affected, e.g. long bootup after restart
-
set NOOUT, stop OSD and mount it
ceph osd set noout
systemctl stop ceph-FSID@osd.OSD.service
ceph-volume lvm activate OSD OSD-FSID --no-systemd -
create list of PGs on OSD
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op list-pgs > osd.11.pgs.txt
-
Check for DUPs on all PGs
while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11/ --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.11.pgs.txt 2>&1 | tee dups.log
-
run tool on affected PGs. Check Memory Usage - it depends on the parameter "osd_pg_log_trim_max" - we observed around 4G with osd_pg_log_trim_max=500000. we identified osd_pg_log_trim_max=500000 as the optimum value. further increasing osd_pg_log_trim_max will not speed up the process.
time ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op trim-pg-log --pgid PG --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
Recommendation
In the example above, osd_pg_log_trim_max is already very high, increasing it further would not increase the speed. For safety reasons, we recommend starting with a smaller number, e.g. 100000. From the experience of the first run, you can then think about increasing or accelerating the speed.
Notice
Special care should be taken with the following actions:
- Ceph Cluster Upgrade
- pg split or pg merge- manually or automatically via autoscaler
- use of the current patch
The listed actions can unintentionally trigger the trimming of the pg dups, which can lead to osds being inaccessible as long as they perform the trim action and can consume an enormous amount of main memory.
This could have a particularly extreme effect on osd in connection with rook and the kubernetes liveness probe.
What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? github.com/ceph/ceph/pull/45529