Skip to main content

RADOS

CES and Ceph KB articles related to RADOS, covering the MON, MGR, and OSDs.


Cluster - Pause Whole Cluster For Maintenance

Problem

Admin wants to shutdown whole Cluster ( all Nodes ) for maintenance.

Solution

ceph osd set noout
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set noscrub
ceph osd set nodeep-scrub
ceph osd set pause

Discussion

To restart cluster run commands in reverse order


OSD - PGs Stuck Activating

Problem

A user reports that whenever they change the weight of an OSD (e.g. mark an osd in), several PGs get stuck in the activating state, and ceph status reports many slow ops warnings.

Solution

Increase the hard limit on the number of PGs per OSD using:

# ceph config set osd osd_max_pg_per_osd_hard_ratio 10

Discussion

Changes to the CRUSH map or OSD weights may temporarily cause the number of PGs mapped to an OSD to exceed the mon_max_pg_per_osd. By using a large value for osd_max_pg_per_osd_hard_ratio, we can configure the OSD to not block PGs from activating in this transient case.


OSD - Laggy Placement Groups

Problem

A customer cluster running v16.2.7 reports that whenever they stop an OSD, PGs go into laggy state and cluster IO stops for several seconds.

Solution

This is a known issue related to the implementation of the osd_fast_shutdown feature in early pacific v16 releases. As a workaround, use:

# ceph config set osd osd_fast_shutdown_notify_mon true

It is then recommended to upgrade upgrade to the latest v16 release (v16.2.13 at the time of writing). Note that osd_fast_shutdown_notify_mon = true is now the default in current Ceph releases as of summer 2023.

Discussion

The osd_fast_shutdown feature was added in pacific as a quicker way to shutdown the OSD. In the previous approach, the OSD would call the destructors for all OSD classes and safely close all files like the rocksdb and objects. With osd_fast_shutdown, the OSD simply aborts its process. The thinking is that the OSD can already cleanly recover from a power loss, so this type of abrupt stop is preferable. The problem is that the mon takes a long time to notice that an OSD has shut down like this, so the osd_fast_shutdown_notify_mon option was added to send a message to the mon, letting it know that this OSD is stopping. This allows the PGs to re-peer quickly and avoid a long IO pause.


OSD - Repairing Inconsistent PG

Problem

Ceph is warning about inconsistent PGs.

Solution

Users are advised to refer to the upstream documentation Repairing Inconsitent PGs.

If users notice that deep-scrub is discovering inconsistent objects with a regular frequency, and if those errors coincide with SCSI Medium Errors on the underlying drives, it is recommended to switch on automatic repair of damaged objects detected during scrub:

# ceph config set osd osd_scrub_auto_repair true

OSD - Improved Procedure for Adding Hosts or OSDs

Problem

When I add many hosts with new capacity to Ceph, way too much data needs to be backfilled and my cluster becomes unstable.

Solution

CLYSO recommends the following improved approach when adding capacity to a Ceph cluster. This procedure makes use of an external tool ("upmap-remapped.py") and the MGR balancer in order gain more control on the data movement needed to add hosts to an existing cluster.

Before you start, you must first apply our recommended MGR balancer configuration. The balancer is a Ceph component which moves objects around to achieve a uniform data distribution.

It is best to configure the balancer to leave some idle time per week, so that internal data structures ("osdmaps") can be trimmed regularly. For example, with this config balancer will pause on Saturdays:

ceph config set mgr mgr/balancer/begin_weekday 0
ceph config set mgr mgr/balancer/end_weekday 5

Alternatively you may choose to balance PGs only some hours each day, for example, allow the backfilling PGs complete each night:

ceph config set mgr mgr/balancer/begin_time 0830
ceph config set mgr mgr/balancer/end_time 1800

Next, decrease the max misplaced ratio from its default 5% to 0.5%, to minimize the IO impact of backfilling and also ensure the tail of backfilling PGs can finish over the weekend or over night. You may increase this percentage if you find that 0.5% is too small for your cluster.

ceph config set mgr target_max_misplaced_ratio 0.005

Lastly, configure the mgr to balance until you have +/- 1 PG per pool per OSD -- this is the best uniformity we can hope for with the mgr balancer.

ceph config set mgr mgr/balancer/upmap_max_deviation 1

Procedure to add hosts with upmap-remapped

With the above balancer configuration, then you can use this procedure to add hosts gracefully using upmap-remapped.

  1. Set these flags to prevent data from moving immediately when we add new OSDs:
ceph osd set norebalance
ceph balancer off
  1. Add the new OSDs using cephadm or your preferred management tool. Note -- we always recommend having watch ceph -s in a window whenever making any changes to your ceph cluster.

  2. Download ./upmap-remapped.py from here. Run it wherever you run ceph CLI commands, and inspect its output:

./upmap-remapped.py

It should output several lines like ceph osd pg-upmap-items .... If not, reach out for help.

  1. Now we run upmap-remapped for real, normally twice in order to get the minimal number of misplaced objects:
./upmap-remapped.py | sh -x
./upmap-remapped.py | sh -x

While the above are running, you should see the % misplaced objects decreasing in your ceph -s terminal. Ideally it will go to 0, meaning all PGs are active+clean and the cluster is fully healthy.

  1. Finally, unset the flags so data starts rebalancing again. At this point, the mgr balancer will move data in a controlled manner to your new empty OSDs:
ceph osd unset norebalance
ceph balancer on

Discussion

Placement Groups, Upmap, and the Balancer are all complex topics but offer very power tools to optimize Ceph operations. CLYSO has presented on this topic regularly -- feel free to reach out if you have any questions: