Slow Scrub
Problem
In a 17.2.7 cluster, for around a month, scrub and deep-scrub warnings appeared:
health: HEALTH_WARN
7 pgs not deep-scrubbed in time
55 pgs not scrubbed in time
The logfiles showed that there were many scrub start and deep-scrub start
messages, but no corresponding scrub ok messages when the scrubs had
completed.
For each pool in the cluster the scrub_min_interval was set to 172800 and
the deep_scrub_interval was set to 1209600, using the following commands:
ceph osd pool set $pool scrub_min_interval 172800
ceph osd pool set $pool deep_scrub_interval 1209600
Solution
Ensure that placement groups can find a "scrub slot" (see the "Discussion" section below for more on scrub slots). Examine the pools in order to find inefficiencies in your cluster. If you have a very high number of objects per placement group, review your strategy and consider changing from an erasure-coding strategy to a replication strategy.
- Set
ceph config set osd osd_max_scrubs 3or5, if you want to scrub more aggressively. ceph config set global mon_warn_pg_not_scrubbed_ratio 10Increase this to 10 after you are convinced that the scrubs take long for good reason.ceph config set global mon_warn_pg_not_deep_scrubbed_ratio 10# same as above.- Review your Pool stored/used values in order to find inefficiencies.
- Review
ceph pg dump | grep active | sort -k2 -nto sort PGs by the number of objects per PG. If there are several pools in your cluster with >1M objects per PG, then the PG numbers and EC/replication strategy should be reviewed.
Discussion
When placement groups are constantly trying to perform a deep scrub but don't
carry out that deep scrub, it indicates that they are failng to acquire a
"scrub slot", which is controlled with the osd_max_scrubs option (the default
value of this option is 1). Set this value to 3 by running the ceph config set osd osd_max_scrubs 3 command. (The default value of
osd_max_scrubs will be 3 by default from Ceph version 17.2.8+)
How scrubbing works:
- An OSD looks at its list of PGs for which it is primary. If any are due for scrubbing or deep scrubbing, then the OSD attempts to scrub.
- That OSD asks all the other OSDs involved in that PG if they have an available scrub slot. (for example, 2 others for 3x replicated PGs, and 11 others in the case of size 12 erasure coded placement groups).
- If all the participating OSDs have an available scrub slot, then scrub starts. If not, try again later.
So you can see -- with wide EC pools (as, for example, of size 12) and
osd_max_scrubs = 1, it is very difficult to find occasions when all 12 OSDs
are not scrubbing any other PG. For this reason, your PGs don't get scrubbed
on time. (This is combined with the fact that you have many objects, so
scrubbing individual PGs also takes hours, if it manages to start).
Under some circumstances, a placement group can lock OSDs, making them unable to participate in any other scrubs. In one case, a placement group had been scrubbing for 16015 seconds (4.5 hours) -- it had 516689 objects and 722GB of data. This took a long time to read, perform a CRC (cylic redundancy check) on, and compare across twelve OSDs.
In another example, a placement group had been scrubbing for 53305s (15 hours). It contained 6962047 objects and 248GB of data. That example demonstrates that the object count impacts the scrub time more than the volume of data in bytes.
Side comment: Having 7 million objects in one single PG is not advisable. The PG autoscaler may not be smart enough to detect these type of imbalances.
In a certain single CephFS pool, the average object size was 9.4TB/ 223M = 42 kilobytes. This type of small file is not very efficient. In this case, this occurred in the context of a size 12 erasure coded pool, and the on disk storage was highly inefficient and consequently the performance very poor. (You can see that 28TB was used to store 9.4 TB of data!)
In cases like this, it is recommended that the data be moved to a 3x replicated pool with an appropriate number of PGs.