Skip to main content

47 posts tagged with "operation"

View All Tags

· One min read
Joachim Kraftmayer

Ceph cluster health state

root@clyso-ceph:/home/ceph# ceph -s
cluster:
id: FEBB01CC-4AA5-4C1D-80C4-9D91901467C8
health: HEALTH_WARN
256 pgs not deep-scrubbed in time
266 pgs not scrubbed in time

Admin has noticed that deep-scrubbing has fallen behind.

Possible actions

If impacting ceph performance is not a concern. osd max scrubs and osd scrub load threshold can be carefully be adapted, but that can have a huge impact on ceph cluster performance.

Show current config

root@clyso-ceph:/home/ceph# ceph config show osd.1 osd_max_scrubs
1
root@clyso-ceph:/home/ceph#
root@clyso-ceph:/home/ceph# ceph config get osd osd_scrub_load_threshold
0.500000
root@clyso-ceph:/home/ceph#

Set osd_max_scrub

ceph config set osd_max_scrubs 2

Verify setting

Ceph config database

root@ceph-vm-az1-1:/home/kraftmayerj# ceph config get osd osd_max_scrubs
2
root@ceph-vm-az1-1:/home/kraftmayerj#

Ceph osd active settings (osd.1)

root@clyso-ceph:/home/ceph# ceph config show osd.1 osd_max_scrubs
2
root@clyso-ceph:/home/ceph#

Sources

osd_max_scrubs

osd_scrub_load_threshold

osd_scrub_during_recovery

· One min read
Joachim Kraftmayer

At the request of a customer, we were asked to adapt the performance and capacity of their existing Ceph cluster to the growth in ongoing production.

As a result, without maintenance windows and without service interruption, we have increased the Ceph cluster by 3700% during operation.

· One min read
Joachim Kraftmayer

On behalf of the customer, we provided the optimal Kubernetes platform for ONAP as a managed service.

ONAP is a comprehensive platform for orchestration, management, and automation of network and edge computing services for network operators,

cloud providers, and enterprises. Real-time, policy-driven orchestration and automation of physical and virtual network functions enables rapid automation of new services and complete lifecycle management critical for 5G and next-generation networks.

· One min read
Joachim Kraftmayer

After a reboot of the MDS Server it can happen that the CephFS Filesystem becomes read-only:

HEALTH_WARN 1 MDSs are read only
[WRN] MDS_READ_ONLY: 1 MDSs are read only
mds.XXX(mds.0): MDS in read-only mode
[https://tracker.ceph.com/issues/58082](https://tracker.ceph.com/issues/58082)

In the MDS log you will find following entry

log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22 mds.0.11963 unhandled write error (22) Invalid argument, force readonly... mds.0.cache force file system read-only log_channel(cluster) log [WRN] : force file system read-only mds.0.server force_clients_readonly

https://tracker.ceph.com/issues/58082

This is a known upstream issue thought the fix is still not merged

As a workaround you can use following steps:

ceph config set mds mds_dir_max_commit_size 80
ceph fs fail <fs_name>
ceph fs set <fs_name> joinable true

If not successful you may need to increase the mds_dir_max_commit_size, e.g. to 160

· One min read
Joachim Kraftmayer

if you had to recreate the device_health or .mgr pool, the healthdevice module is missing his sqlite3 database structure. You have recreate the structure manually.

crash events

backtrace": &#91;
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n self.scrape_all()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n self.put_device_metrics(device, data)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n self._create_device(devid)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n cursor = self.db.execute(SQL, (devid,))",
"sqlite3.InternalError: unknown operation"
apt install libsqlite3-mod-ceph libsqlite3-mod-ceph-dev

create database

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>

list databases

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.databases'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>

create table

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite> CREATE TABLE IF NOT EXISTS MgrModuleKV (
key TEXT PRIMARY KEY,
value NOT NULL
) WITHOUT ROWID;
sqlite> INSERT OR IGNORE INTO MgrModuleKV (key, value) VALUES ('__version', 0);
sqlite> .tables
Device DeviceHealthMetrics MgrModuleKV
sqlite>

sources

https://ceph.io/en/news/blog/2021/new-in-pacific-sql-on-ceph https://docs.ceph.com/en/latest/rados/api/libcephsqlite/ https://docs.ceph.com/en/latest/rados/api/libcephsqlite/#usage https://github.com/ceph/ceph/blob/main/src/pybind/mgr https://github.com/ceph/ceph/blob/main/src/pybind/mgr/devicehealth/module.py

· One min read
Joachim Kraftmayer

If you don't want to set flags for the whole cluster, like noout or noup. Then you can also use ceph osd set-group and ceph osd unset-group to set the appropriate flag for a group of osds or even whole hosts.

ceph osd set-group <flags> <who>
ceph osd unset-group <flags> <who>

for example set noout for a whole host with osds

ceph osd set-group noout clyso-ceph-node3
``

```bash
root@clyso-ceph-node1:~# ceph health detail
HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
[WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
host clyso-ceph-node3 has flags noout
ceph osd unset-group noout clyso-ceph-node3
root@clyso-ceph-node1:~# ceph health detail
HEALTH_OK
root@clyso-ceph-node1:

Sources:

docs.ceph.com/en/quincy/rados/operations/health-checks/#osd-flags