Skip to main content

· One min read
Joachim Kraftmayer

get config (default: 4G)

ceph daemon mds.<mds-id> config get mds_cache_memory_limit
ceph daemon /var/run/ceph/<fsid>/<mds-id> config get mds_cache_memory_limit
ceph tell mds.storefs-a config show |grep mds_cache_memory_limit

set config on the fly not persistent (to 64 GB)

ceph daemon mds.<mds.id> config set mds_cache_memory_limit 68719476736
ceph daemon /var/run/ceph/<fsid>/<mds-id> set mds_cache_memory_limit 68719476736
ceph tell mds.storefs-a injectargs --mds_cache_memory_limit 68719476736

persist config ( to 64 GB)

ceph config set mds mds_cache_memory_limit 68719476736

· One min read
Joachim Kraftmayer

We have two options to get the gateway.conf:

gwcli

gwcli export mode=copy

or

rados

rados -p iscsi get gateway.conf /root/gateway.conf

At the moment there is no way to update or write the gateway.conf via gwcli command. So the only option is to use the rados command line tool.

note

Be careful when editing the content manually, it requires care and expertise.

sources

docs.ceph.com/en/latest/man/8/rados

manpages.ubuntu.com/manpages/jammy/man8/gwcli.8.html

· 2 min read
Joachim Kraftmayer

You might have wondered how to get rid of the warning "ceph warning - pools have many more objects per pg than average", because you want to see your cluster in HEALTH_OK status. The option to change the thresholds for this warning is: mon_pg_warn_max_object_skew

Especially for start in production of a ceph cluster or a new pool you can set the threshold high. After a certain time you should always check the value and adjust it if necessary.

An important note for the option is that it must be set on ceph mgr, often you can find posts that set the option on ceph mon and see no effect on ceph status.

The cluster status commands 

ceph status

or

ceph health detail

shows the following warning:

[WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than average pool test objects per pg (2079) is more than 11.6798 times cluster average (178)

note

To disable the warning completely the value of mon_pg_warn_max_object_skew must be set to 0 or a negative number.

Verify the default value

ceph config get mgr mon_pg_warn_max_object_skew10.000000

inject the value:

ceph tell mgr.a injectargs '--mon_pg_warn_max_object_skew 50'

verify the value:

ceph tell mgr.a config get mon_pg_warn_max_object_skew
{
"mon_pg_warn_max_object_skew": "50.000000"
}

set the value persistent, e.g. 50 times higher

ceph config set mgr mon_pg_warn_max_object_skew 50
ceph config get mgr mon_pg_warn_max_object_skew
50.000000

· One min read
Joachim Kraftmayer

We were speakers at the first edition of Cloudland

Cloudland is the festival of the German-speaking Cloud Native Community (DCNC), with the aim of communicating the current status quo in the use of cloud technologies and focusing in particular on future challenges.

Our contribution on Multi Cloud Deployment met with great interest at the "Container & Cloud Technologies" theme day. cloudland

cloudland

cloudland

cloudland

· 2 min read
Joachim Kraftmayer

ceph-volume can be used to create for a existing OSD a new WAL/DB on a faster device without the need to recreate the OSD.

ceph-volume lvm new-db --osd-id 15 --osd-fsid FSID --target cephdb/cephdb1
--> NameError: name 'get_first_lv' is not defined
this is a bug in ceph-volume v16.2.7 that will be fixed in v16.2.8
[https://github.com/ceph/ceph/pull/44209](https://github.com/ceph/ceph/pull/44209)

First create a new Logical Volume on the Device that will hold the new WAL/DB

vgcreate cephdb /dev/sdb
Volume group "cephdb" successfully created
lvcreate -L 100G -n cephdb1 cephdb
Logical volume "cephdb1" created.

Now stop running OSD and if it was deactivated ( cephadm ) then activate it on the host

systemctl stop ceph-FSID@osd.0.service
ceph-volume lvm activate --all --no-systemd

Create new WAL/DB on new Device

ceph-volume lvm new-db --osd-id 0 --osd-fsid OSD-FSID --target cephdb/cephdb1
--> Making new volume at /dev/cephdb/cephdb1 for OSD: 0 (/var/lib/ceph/osd/ceph-0)
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-1
--> New volume attached.

Migrate existing WAL/DB to new Device

ceph-volume lvm migrate --osd-id 0 --osd-fsid OSD-FSID --from data --target cephdb/cephdb1
--> Migrate to existing, Source: ['--devs-source', '/var/lib/ceph/osd/ceph-0/block'] Target: /var/lib/ceph/osd/ceph-0/block.db
--> Migration successful.

Deactivate OSD and start it

ceph-volume lvm deactivate 0
Running command: /bin/umount -v /var/lib/ceph/osd/ceph-0
stderr: umount: /var/lib/ceph/osd/ceph-0 unmounted
systemctl start ceph-FSID@osd.0.service

· 2 min read
Joachim Kraftmayer

First we wanted to use ceph-bluestore-tool bluefs-bdev-new-wal. However, it turned out that it is not possible to ensure that the second DB is actually used. For this reason, we decided to migrate the entire bluefs of the osd to an ssd/flash.

bluestore

Verify the current osd bluestore setup

ceph-bluestore-tool show-label –dev device []

Verify the current size of the osd bluestore DB

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>
ceph-bluestore-tool bluefs-bdev-migrate –path osd path –dev-target new-device –devs-source device1 [–devs-source device2]

Verify the size of the osd bluestore DB after the migration

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>

if the size does not correspond to the new target size execute the following command:

ceph-bluestore-tool bluefs-bdev-expand –path osd path

Instruct BlueFS to check the size of its block devices and, if they have expanded, make use of the additional space. Please note >that only the new files created by BlueFS will be allocated on the preferred block device if it has enough free space, and the >existing files that have spilled over to the slow device will be gradually removed when RocksDB performs compaction. In other >words, if there is any data spilled over to the slow device, it will be moved to the fast device over time. https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/#commands

Verify the new osd bluestore setup

ceph-bluestore-tool show-label –dev device []

Update

You might be interested in a migration method on a higher layer with ceph-volume lvm.

docs.clyso.com/blog/ceph-volume-ceph-osd-migrate-db-to-larger-ssd-flash-device/

Appendix

I'm trying to figure out the appropriate process for adding a separate SSD block.db to an existing OSD. From what I gather the two steps are: 1. Use ceph-bluestore-tool bluefs-bdev-new-db to add the new db device 2. Migrate the data ceph-bluestore-tool bluefs-bdev-migrate I followed this and got both executed fine without any error. Yet when the OSD got started up, it keeps on using the integrated block.db instead of the new db. The block.db link to the new db device was deleted. Again, no error, just not using the new db www.spinics.net/lists/ceph-users/msg62357.html

Sources

docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool

tracker.ceph.com/attachments/download/4478/bluestore.png

www.suse.com/support/kb/doc/?id=000020276

· One min read
Joachim Kraftmayer

But, I already mentioned it (for a bit different case) in newer versions there is ceph-volume lvm migrate [1] which I think allows to do the same but in much simpler way. I have not tried it to and the documentation is not very clear to me so one need to experiment with this before writing exact instructions. We might also need to use new-db [2] and new-wal [3] commands before running migrate but I am not sure they are needed for this particular case.

[1] https://docs.ceph.com/en/latest/ceph-volume/lvm/migrate/

[2] https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

[3] https://docs.ceph.com/en/latest/ceph-volume/lvm/newwal/

· One min read
Joachim Kraftmayer

Wed, June 8, 2:50pm - 3:20pm | Berlin Congress Center - B - B09

Ceph on WindowsPrivate & Hybrid Cloud

Ceph RADOS, RBD and CephFS have been ported on Microsoft Windows, a community effort led by SUSE and Cloudbase Solutions. The goal consisted in porting librados and librdb on Windows Server, providing a kernel driver for exposing RBD devices natively as Windows volumes, support for Hyper-V VMs and last but not least, even CephFS. During this session we will talk about the architectural differences between Windows and Linux from a storage standpoint and how we retained the same CLI so that long time Ceph users will feel at home regardless of the underlying operating system. Performance is a key aspect of this porting, with Ceph on Windows significantly outperforming the iSCSI gateway, previously the main option for accessing RBD images from Windows nodes. There will be no lack of live demos, including automating the installation of the Windows binaries, setting up and managing a Ceph cluster across Windows and Linux nodes, spinning up Hyper-V VMs from RBD, and CephFS.

openinfra.dev/summit-schedule

· 3 min read
Joachim Kraftmayer

Follow the recommendation

We have seen many different EC profiles over the last 10+ years, but few that follow the official ceph.io recommendations.

We generally recommend min_size be K+2 or more to prevent loss of writes and data.
docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery

Erasure Coding vs RAID in production

Erasure coding and RAID (e.g. RAID5, RAID6,…) are often compared to data chunks and coding chunks because of the architecture.

However, they differ considerably from each other in productive use.

global rule vs data set rule

In software or hardware RAID, the number of hard disks, including hotspare, for storing all data is fixed.
In Ceph, however, e.g. with an EC profile of 8 + 3 and Failure Domain HOST, a total of 11 servers with one hard disk each are involved in storing a data set.
For the next data set, other servers or other hard disks are used.

Key facts for ceph recovery

If a hard disk fails, another hard disk is immediately allocated as the storage location.

time

The decisive factor for data security is how long it takes to restore the data and how high the probability is that other hard disks will fail during the recovery period.

Risk

Further failures extend the recovery period or if more than 3 hard disks fail, data loss occurs for the part of the data stored there.

SIZING

The recovery time depends on physical components such as the number of available hard disks, their fill level and the throughput.

CONFIG

The recovery behaviour is also significantly influenced by the correct choice of Ceph configuration parameters.
Care should always be taken to ensure that the parameters are in relation to the physical hardware, e.g. priority of the recovery in relation to the response to client requests during operation:

  • priority of the recovery in relation to the response to client requests during operation.
  • optimal choice of PGs for the distribution of data
  • distinction between SLA for read access and write access

Our general opinion on EC is

Originally I didn't like it much and would like to avoid it whenever possible. Mainly because it's much more complicated (more bugs), much harder to restore ("partial" restore is not possible) and performance is usually worse. But "saved space" sounds too tempting at first glance. With that said, it is inevitable in the future and there are actually cases where it is fine and can even work better than a replicated pool, e.g. when storing large data such as backup tarballs or videos, or when the writes are aligned to the stripe width (i.e. the application needs to know how to write effectively).

Sources

docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery