The Ceph Community hosted its first post-pandemic event at the Bloomberg offices in New York City. Ceph Day NYC was a great success!
Ceph Reef vs Quincy RBD Performance
Clyso's Mark Nelson has written the first part in a series looking at performance testing of the upcoming Ceph Reef release vs the previous Quincy release. See the blog post here!
Please feel free to contact us if you are interested in Ceph support or performance consulting!
ceph - how do disable mclock scheduler
After more than 4 years of development, mclock is the default scheduler for ceph quincy (version 17).If you don't want to use the scheduler, you can disable it with the option osd_op_queue.
WPQ was the default before Ceph Quincy and the change requires a restart of the OSDs.
Source:
https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#confval-osd_op_queue"
https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#qos-based-on-mclock>"
Fix CephFS Filesystem Read-Only
After a reboot of the MDS Server it can happen that the CephFS Filesystem becomes read-only:
HEALTH_WARN 1 MDSs are read only
[WRN] MDS_READ_ONLY: 1 MDSs are read only
mds.XXX(mds.0): MDS in read-only mode
[https://tracker.ceph.com/issues/58082](https://tracker.ceph.com/issues/58082)
In the MDS log you will find following entry
log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22 mds.0.11963 unhandled write error (22) Invalid argument, force readonly... mds.0.cache force file system read-only log_channel(cluster) log [WRN] : force file system read-only mds.0.server force_clients_readonly
This is a known upstream issue thought the fix is still not merged
As a workaround you can use following steps:
ceph config set mds mds_dir_max_commit_size 80
ceph fs fail <fs_name>
ceph fs set <fs_name> joinable true
If not successful you may need to increase the mds_dir_max_commit_size, e.g. to 160
ceph Quincy release with bugfix for PGLog dups
Our bugfix from earlier this year was published in the ceph quincy release 17.2.4.
Trimming of PGLog dups is now controlled by size instead of the version. This fixes the PGLog inflation issue that was happening when online (in OSD) trimming jammed after a PG split operation. Also, a new offline mechanism has been added: ceph-objectstore-tool now has a trim-pg-log-dups op that targets situations where an OSD is unable to boot due to those inflated dups. If that is the case, in OSD logs the “You can be hit by THE DUPS BUG” warning will be visible. Relevant tracker: https://tracker.ceph.com/issues/53729"
related posts
osds with unlimited ram growth
how to identify osds affected by pg dup bug
Sources
https://docs.ceph.com/en/latest/releases/quincy/#v17-2-4-quincy
\[WRN\] clients failing to respond to cache pressure
At the time when the MDS cache runs full, the process must clear inodes from its cache. This also means that the MDS will prompt some clients to also clear some inodes from their cache.
The MDS asks the cephfs client several times to release the inodes. If the client does not respond to this cache recall request, Ceph will log this warning.
IBM will add Red Hat storage product roadmaps and Red Hat associate teams to the IBM Storage business unit
ARMONK, N.Y., Oct. 4, 2022 /PRNewswire/ -- IBM (NYSE: IBM) announced today it will add Red Hat storage product roadmaps and Red Hat associate teams to the IBM Storage business unit, bringing consistent application and data storage across on-premises infrastructure and cloud.
Sources:
rook ceph validate that the RBD cache is active inside the k8s pod
validate if the RBD Cache is active on your client
By default the cache is enabled, since Version 0.87.
To enable the cache on the client side you have to add following config /etc/ceph/ceph.conf:
[client]
rbd cache = true
rbd cache writethrough until flush = true
add local admin socket
So that you can also verify the status on the client side, you must add the following two parameters:
[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/
configure permissions and security
Both paths must be writable by the user who uses the RBD library. Applications such as SELinux or AppArmor must be properly configured.
request infos via admin socket
Once this is done, run your application that is supposed to use librbd (kvm, docker, podman, ...) and request the information via the admin daemon socket:
$ sudo ceph --admin-daemon /var/run/ceph/ceph-client.admin.66606.140190886662256.asok config show | grep rbd_cache "rbd_cache": "true", "rbd_cache_writethrough_until_flush": "true", "rbd_cache_size": "33554432", "rbd_cache_max_dirty": "25165824", "rbd_cache_target_dirty": "16777216", "rbd_cache_max_dirty_age": "1", "rbd_cache_max_dirty_object": "0", "rbd_cache_block_writes_upfront": "false",
Verify the cache behaviour
To compare the performance difference you can test the cache, you can deactivate it in the [client] section in your ceph.conf as follows:
[client]
rbd cache = false
Then run a fio benchmark with the following command:
fio --ioengine=rbd --pool=<pool-name> --rbdname=rbd1 --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_base
Finally, run this test with RBD client cache enabled and disabled and you should notice a significant difference.
Sources
https://www.sebastien-han.fr/blog/2015/09/02/ceph-validate-that-the-rbd-cache-is-active/
ceph-mgr recreate sqlite database for healthdevice module
if you had to recreate the device_health or .mgr pool, the healthdevice module is missing his sqlite3 database structure. You have recreate the structure manually.
crash events
backtrace": [
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n self.scrape_all()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n self.put_device_metrics(device, data)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n self._create_device(devid)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n cursor = self.db.execute(SQL, (devid,))",
"sqlite3.InternalError: unknown operation"
apt install libsqlite3-mod-ceph libsqlite3-mod-ceph-dev
create database
clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>
list databases
clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.databases'
main: "" r/w
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite>
create table
clyso@compute-21:~$ sqlite3 -cmd '.load libcephsqlite.so' -cmd '.open file:///.mgr:devicehealth/main.db?vfs=ceph'
SQLite version 3.39.1 2022-07-13 19:41:41
Enter ".help" for usage hints.
sqlite> CREATE TABLE IF NOT EXISTS MgrModuleKV (
key TEXT PRIMARY KEY,
value NOT NULL
) WITHOUT ROWID;
sqlite> INSERT OR IGNORE INTO MgrModuleKV (key, value) VALUES ('__version', 0);
sqlite> .tables
Device DeviceHealthMetrics MgrModuleKV
sqlite>
sources
https://ceph.io/en/news/blog/2021/new-in-pacific-sql-on-ceph https://docs.ceph.com/en/latest/rados/api/libcephsqlite/ https://docs.ceph.com/en/latest/rados/api/libcephsqlite/#usage https://github.com/ceph/ceph/blob/main/src/pybind/mgr https://github.com/ceph/ceph/blob/main/src/pybind/mgr/devicehealth/module.py
Ceph S3 load and performance test
motivation
we have tested ceph s3 in openstack swift intensively before. We were interested in the behavior of the radosgw stack in ceph. We paid particular attention to the size and number of objects in relation to the resource consumption of the radosgw process. Effects on response latencies of radosgw were also important to us. To be able to plan the right sizing of the physical and virtual environments.
technical topics
From a technical point of view, we were interested in the behavior of radosgw in the following topics.
- dynamic bucket sharding
- http frontend difference between Civetweb and Beast
- index pool io pattern and latencies
- data pool io pattern and latencies with erasure-coded and replicated pools
- fast_read vs. standard read for workloads with large and small objects.
requirements
when choosing the right tool, it was important for us to be able to test both small and large ceph clusters with several thousand osds.
We want to use the test results as a file for evaluation as well as have a graphical representation as timeseries data.
For timeseries data we rely on the standard stack with Grafana, Prometheus and Thanos.
the main prometheus exporters we use are ceph-mgr-exporter and node-exporter.
load and performance tools
CBT - The Ceph Benchmarking Tool
CBT is a testing harness written in python
s3 - tests
This is a set of unofficial Amazon AWS S3 compatibility tests
https://github.com/ceph/s3-tests
COSBench - Cloud Object Storage Benchmark
COSBench is a benchmarking tool to measure the performance of Cloud Object Storage services.
https://github.com/intel-cloud/cosbench
Gosbench
Gosbench is the Golang reimplementation of Cosbench. It is a distributed S3 performance benchmark tool with Prometheus exporter leveraging the official Golang AWS SDK
https://github.com/mulbc/gosbench
hsbench
hsbench is an S3 compatable benchmark originally based on wasabi-tech/s3-benchmark.
https://github.com/markhpc/hsbench
Warp
Minio - S3 benchmarking tool.
the tool of our choice
getput
getput can be run individually on a test client.
gpsuite is responsible for synchronization and scaling across any number of test clients. Communication takes place via ssh keys and the simultaneous start of all s3 test clients is synchronized over a common time base.
Installation on linux as script or as container is supported.