Skip to main content

· One min read
Joachim Kraftmayer

After more than 4 years of development, mclock is the default scheduler for ceph quincy (version 17).If you don't want to use the scheduler, you can disable it with the option osd_op_queue.

WPQ was the default before Ceph Quincy and the change requires a restart of the OSDs.

Source:

https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#confval-osd_op_queue"

https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#qos-based-on-mclock>"

· One min read
Joachim Kraftmayer

After a reboot of the MDS Server it can happen that the CephFS Filesystem becomes read-only:

HEALTH_WARN 1 MDSs are read only
[WRN] MDS_READ_ONLY: 1 MDSs are read only
mds.XXX(mds.0): MDS in read-only mode
[https://tracker.ceph.com/issues/58082](https://tracker.ceph.com/issues/58082)

In the MDS log you will find following entry

log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22 mds.0.11963 unhandled write error (22) Invalid argument, force readonly... mds.0.cache force file system read-only log_channel(cluster) log [WRN] : force file system read-only mds.0.server force_clients_readonly

https://tracker.ceph.com/issues/58082

This is a known upstream issue thought the fix is still not merged

As a workaround you can use following steps:

ceph config set mds mds_dir_max_commit_size 80
ceph fs fail <fs_name>
ceph fs set <fs_name> joinable true

If not successful you may need to increase the mds_dir_max_commit_size, e.g. to 160

· One min read
Joachim Kraftmayer

Our bugfix from earlier this year was published in the ceph quincy release 17.2.4.

Trimming of PGLog dups is now controlled by size instead of the version. This fixes the PGLog inflation issue that was happening when online (in OSD) trimming jammed after a PG split operation. Also, a new offline mechanism has been added: ceph-objectstore-tool now has a trim-pg-log-dups op that targets situations where an OSD is unable to boot due to those inflated dups. If that is the case, in OSD logs the “You can be hit by THE DUPS BUG” warning will be visible. Relevant tracker: https://tracker.ceph.com/issues/53729"

osds with unlimited ram growth

how to identify osds affected by pg dup bug

Sources

https://docs.ceph.com/en/latest/releases/quincy/#v17-2-4-quincy

· One min read
Joachim Kraftmayer

At the time when the MDS cache runs full, the process must clear inodes from its cache. This also means that the MDS will prompt some clients to also clear some inodes from their cache.

The MDS asks the cephfs client several times to release the inodes. If the client does not respond to this cache recall request, Ceph will log this warning.

· One min read
Joachim Kraftmayer

ARMONK, N.Y., Oct. 4, 2022 /PRNewswire/ -- IBM (NYSE: IBM) announced today it will add Red Hat storage product roadmaps and Red Hat associate teams to the IBM Storage business unit, bringing consistent application and data storage across on-premises infrastructure and cloud.

Sources:

2022-10-04-IBM-Redefines-Hybrid-Cloud-Application-and-Data-Storage-Adding-Red-Hat-Storage-to-IBM-Offerings

· 2 min read
Joachim Kraftmayer

validate if the RBD Cache is active on your client

By default the cache is enabled, since Version 0.87.

To enable the cache on the client side you have to add following config /etc/ceph/ceph.conf:

[client]
rbd cache = true
rbd cache writethrough until flush = true

add local admin socket

So that you can also verify the status on the client side, you must add the following two parameters:

[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/

configure permissions and security

Both paths must be writable by the user who uses the RBD library. Applications such as SELinux or AppArmor must be properly configured.

request infos via admin socket

Once this is done, run your application that is supposed to use librbd (kvm, docker, podman, ...) and request the information via the admin daemon socket:

$ sudo ceph --admin-daemon /var/run/ceph/ceph-client.admin.66606.140190886662256.asok config show | grep rbd_cache "rbd_cache": "true", "rbd_cache_writethrough_until_flush": "true", "rbd_cache_size": "33554432", "rbd_cache_max_dirty": "25165824", "rbd_cache_target_dirty": "16777216", "rbd_cache_max_dirty_age": "1", "rbd_cache_max_dirty_object": "0", "rbd_cache_block_writes_upfront": "false",

Verify the cache behaviour

To compare the performance difference you can test the cache, you can deactivate it in the [client] section in your ceph.conf as follows:

[client]
rbd cache = false

Then run a fio benchmark with the following command:

fio --ioengine=rbd --pool=<pool-name> --rbdname=rbd1 --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_base

Finally, run this test with RBD client cache enabled and disabled and you should notice a significant difference.

Sources

https://www.sebastien-han.fr/blog/2015/09/02/ceph-validate-that-the-rbd-cache-is-active/

· One min read
Joachim Kraftmayer

if you had to recreate the device_health or .mgr pool, the healthdevice module is missing his sqlite3 database structure. You have recreate the structure manually.

crash events

backtrace": &#91;
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n self.scrape_all()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n self.put_device_metrics(device, data)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n self._create_device(devid)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n cursor = self.db.execute(SQL, (devid,))",
"sqlite3.InternalError: unknown operation"
apt install libsqlite3-mod-ceph libsqlite3-mod-ceph-dev

create database

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>

list databases

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.databases'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>

create table

clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite> CREATE TABLE IF NOT EXISTS MgrModuleKV (
key TEXT PRIMARY KEY,
value NOT NULL
) WITHOUT ROWID;
sqlite> INSERT OR IGNORE INTO MgrModuleKV (key, value) VALUES ('__version', 0);
sqlite> .tables
Device DeviceHealthMetrics MgrModuleKV
sqlite>

sources

https://ceph.io/en/news/blog/2021/new-in-pacific-sql-on-ceph https://docs.ceph.com/en/latest/rados/api/libcephsqlite/ https://docs.ceph.com/en/latest/rados/api/libcephsqlite/#usage https://github.com/ceph/ceph/blob/main/src/pybind/mgr https://github.com/ceph/ceph/blob/main/src/pybind/mgr/devicehealth/module.py

· 2 min read
Joachim Kraftmayer

motivation

we have tested ceph s3 in openstack swift intensively before. We were interested in the behavior of the radosgw stack in ceph. We paid particular attention to the size and number of objects in relation to the resource consumption of the radosgw process. Effects on response latencies of radosgw were also important to us. To be able to plan the right sizing of the physical and virtual environments.

technical topics​

From a technical point of view, we were interested in the behavior of radosgw in the following topics.

  • dynamic bucket sharding
  • http frontend difference between Civetweb and Beast
  • index pool io pattern and latencies
  • data pool io pattern and latencies with erasure-coded and replicated pools
  • fast_read vs. standard read for workloads with large and small objects.

requirements

when choosing the right tool, it was important for us to be able to test both small and large ceph clusters with several thousand osds.

We want to use the test results as a file for evaluation as well as have a graphical representation as timeseries data.

For timeseries data we rely on the standard stack with Grafana, Prometheus and Thanos.

the main prometheus exporters we use are ceph-mgr-exporter and node-exporter.

load and performance tools​

CBT - The Ceph Benchmarking Tool

CBT is a testing harness written in python

https://github.com/ceph/cbt

s3 - tests

This is a set of unofficial Amazon AWS S3 compatibility tests

https://github.com/ceph/s3-tests

COSBench - Cloud Object Storage Benchmark

COSBench is a benchmarking tool to measure the performance of Cloud Object Storage services.

https://github.com/intel-cloud/cosbench

Gosbench

Gosbench is the Golang reimplementation of Cosbench. It is a distributed S3 performance benchmark tool with Prometheus exporter leveraging the official Golang AWS SDK

https://github.com/mulbc/gosbench

hsbench

hsbench is an S3 compatable benchmark originally based on wasabi-tech/s3-benchmark.

https://github.com/markhpc/hsbench

Warp

Minio - S3 benchmarking tool.

https://github.com/minio/warp

the tool of our choice​

getput

getput can be run individually on a test client.

gpsuite is responsible for synchronization and scaling across any number of test clients. Communication takes place via ssh keys and the simultaneous start of all s3 test clients is synchronized over a common time base.

Installation on linux as script or as container is supported.

https://github.com/markseger/getput

· One min read
Joachim Kraftmayer

If you don't want to set flags for the whole cluster, like noout or noup. Then you can also use ceph osd set-group and ceph osd unset-group to set the appropriate flag for a group of osds or even whole hosts.

ceph osd set-group <flags> <who>
ceph osd unset-group <flags> <who>

for example set noout for a whole host with osds

ceph osd set-group noout clyso-ceph-node3
``

```bash
root@clyso-ceph-node1:~# ceph health detail
HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
[WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
host clyso-ceph-node3 has flags noout
ceph osd unset-group noout clyso-ceph-node3
root@clyso-ceph-node1:~# ceph health detail
HEALTH_OK
root@clyso-ceph-node1:

Sources:

docs.ceph.com/en/quincy/rados/operations/health-checks/#osd-flags