Skip to main content

Ceph Copilot

Installation

The CES image includes the Copilot CLI tool, which is installed by default. Installing through Cephadm is the easiest way to get started with Copilot.

Enter the cephadm shell by cephadm shell you should be able to run ceph-copilot --help and see the help menu.

The Clyso Ceph Copilot Assistant

Ceph Copilot is a CLI assistant designed to help administrators manage their Ceph clusters more efficiently. The tool provides a variety of features to help validate cluster health, simplify complex maintenance tasks, and optimize configurations for improved performance and stability.

What does Copilot do?

  • Cluster Validation: Ceph Copilot checks the health of your Ceph cluster and validates its configuration to ensure optimal performance and reliability.

  • Advanced Monitoring and Advising: Future versions of Ceph Copilot will include agents that monitor OSDs, MDSs, RGWs, and other cluster daemons, providing real-time insights and advice for improved configurations.

Usage

Help Command

$ ceph-copilot --help 
usage: copilot [command]

Ceph Copilot: Your Expert Ceph Assistant.

optional arguments:
-h, --help show this help message and exit
--version, -v, -V show program's version number and exit

subcommands:
valid subcommands

{help,cluster,pools,toolkit}
help Show this help message and exit
cluster List of commands related to the cluster
pools Operations and management of Ceph pools
toolkit A selection of useful Ceph Tools

If you encounter any bugs, please report them at https://ticket.clyso.com/

Cluster Command

Checkup Command

The checkup command performs an overall health and safety check on the cluster. It checks the health of your Ceph cluster and validates its configuration to ensure optimal performance and reliability.

$ ceph-copilot cluster checkup
Running tests: ...!.!.X.!.!..............X...X!..!

Overall score: 29 out of 35 (B-)

- WARN in Version/Check for Known Issues in Running Version: Info: Found 1 low severity issue(s) in running version 17.2.7-1
- WARN in Operating System/OS Support: Operating System is Unknown
- FAIL in Pools/Recommended Flags: Some pools have missing flags
- WARN in Pools/Pool Autoscale Mode: pg_autoscaler is on which may cause unexpected data movement
- WARN in Pools/Pool CRUSH Failure Domain Buckets: Not enough crush failure domain buckets for some pools
- FAIL in OSD Health/Check BlueFS DB/Journal is on Flash: All OSDs have bluefs db/wal or journal on rotational device
- FAIL in OSD Health/OSD host memory: All OSD hosts have insufficient memory
- WARN in OSD Health/OSD host swap: Some OSD hosts have swap enabled
- WARN in OSD Health/Dedicated Cluster Network: Public and Cluster Networks are Shared

Use --verbose for details and recommendations

Toolkit Command

The toolkit command provides a selection of useful Ceph tools.

$ ceph-copilot toolkit --help
usage: copilot toolkit [-h] {list,run} ...

positional arguments:
{list,run}
list List the included Ceph tools
run Run an included Ceph tool

optional arguments:
-h, --help show this help message and exit

Example of the toolkit list command:

$ ceph-copilot toolkit list 
Ceph Tools are installed to /usr/libexec/ceph-copilot/tools

Tools:

clyso-cephfs-recover-metadata
clyso-rgw-find-missing
clyso-ceph-diagnostics-collect
clyso-rados-bulk
contrib/jj_ceph_balancer
cern/upmap-remapped.py

Toolkit run example contrib/jj_ceph_balancer

The contrib/jj_ceph_balancer tool is a Ceph balancer optimized for equal OSD storage utilization and PG placements across all pools. This can be run with the following command:

ceph-copilot toolkit run contrib/jj_ceph_balancer -h
Running tool: contrib/jj_ceph_balancer -h
usage: jj-ceph-balancer [-h] [-v] [-q] [--osdsize {device,weighted,crush}]
{gather,show,showremapped,balance,poolosddiff,repairstats,test,osdmap}
...

Ceph balancer optimized for equal OSD storage utilization and PG placements across all pools.

positional arguments:
{gather,show,showremapped,balance,poolosddiff,repairstats,test,osdmap}
gather only gather cluster information, i.e. generate a state file
repairstats which OSDs repaired their stored data?
test test internal stuff
osdmap compatibility with ceph osd maps

optional arguments:
-h, --help show this help message and exit
-v, --verbose increase program verbosity
-q, --quiet decrease program verbosity
--osdsize {device,weighted,crush}
what parameter to take for determining the osd size. default: crush. device=device_size, weighted=devsize*weight, crush=crushweight*weight

This is adoption of JJ's Ceph Balancer https://github.com/TheJJ/ceph-balancer
This balancer doesn't change your cluster anyway, it just prints the commands you can run to generate movements.
Example: for max 10 pg movements:
`jj-ceph-balancer -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps`
If you're satisfied, run: $ bash /tmp/balance-upmaps

To get pool and OSD usage overview:
jj-ceph-balancer show --osds --per-pool-count --sort-utilization

Checkout more with `jj-ceph-balancer --help`


Pools Command

Example of the pools pg distribution command:

$ ceph-copilot pools pg distribution
# NumSamples = 270; Min = 86.00; Max = 160.00
# Mean = 120.044444; Variance = 119.583210; SD = 10.935411; Median 124.000000
# each # represents a count of 2
86.0000 - 93.4000 [ 18]: #########
93.4000 - 100.8000 [ 0]:
100.8000 - 108.2000 [ 29]: ##############
108.2000 - 115.6000 [ 7]: ###
115.6000 - 123.0000 [ 48]: ########################
123.0000 - 130.4000 [ 166]: ###################################################################################
130.4000 - 137.8000 [ 0]:
137.8000 - 145.2000 [ 0]:
145.2000 - 152.6000 [ 0]:
152.6000 - 160.0000 [ 2]: #

This cluster is an example of 270 OSDs which are distributed across the PGs. The output shows the distribution of the OSDs across the PGs.