Clyso Blog | Clyso GmbH

ceph osd migrate DB to larger ssd/flash device

May 4, 2022 · 2 min read

Managing Director at Clyso

First we wanted to use ceph-bluestore-tool bluefs-bdev-new-wal. However, it turned out that it is not possible to ensure that the second DB is actually used. For this reason, we decided to migrate the entire bluefs of the osd to an ssd/flash.

Verify the current osd bluestore setup

ceph-bluestore-tool show-label –dev device […]

Verify the current size of the osd bluestore DB

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>

ceph-bluestore-tool bluefs-bdev-migrate –path osd path –dev-target new-device –devs-source device1 [–devs-source device2]

Verify the size of the osd bluestore DB after the migration

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>

if the size does not correspond to the new target size execute the following command:

ceph-bluestore-tool bluefs-bdev-expand –path osd path

Instruct BlueFS to check the size of its block devices and, if they have expanded, make use of the additional space. Please note >that only the new files created by BlueFS will be allocated on the preferred block device if it has enough free space, and the >existing files that have spilled over to the slow device will be gradually removed when RocksDB performs compaction. In other >words, if there is any data spilled over to the slow device, it will be moved to the fast device over time. https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/#commands

Verify the new osd bluestore setup

ceph-bluestore-tool show-label –dev device […]

Update

You might be interested in a migration method on a higher layer with ceph-volume lvm.

docs.clyso.com/blog/ceph-volume-ceph-osd-migrate-db-to-larger-ssd-flash-device/

Appendix

I'm trying to figure out the appropriate process for adding a separate SSD block.db to an existing OSD. From what I gather the two steps are: 1. Use ceph-bluestore-tool bluefs-bdev-new-db to add the new db device 2. Migrate the data ceph-bluestore-tool bluefs-bdev-migrate I followed this and got both executed fine without any error. Yet when the OSD got started up, it keeps on using the integrated block.db instead of the new db. The block.db link to the new db device was deleted. Again, no error, just not using the new db www.spinics.net/lists/ceph-users/msg62357.html

Sources

docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool

tracker.ceph.com/attachments/download/4478/bluestore.png

www.suse.com/support/kb/doc/?id=000020276

ceph-volume - ceph osd migrate DB to larger ssd/flash device

May 4, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

But, I already mentioned it (for a bit different case) in newer versions there is ceph-volume lvm migrate [1] which I think allows to do the same but in much simpler way. I have not tried it to and the documentation is not very clear to me so one need to experiment with this before writing exact instructions. We might also need to use new-db [2] and new-wal [3] commands before running migrate but I am not sure they are needed for this particular case.

[1] https://docs.ceph.com/en/latest/ceph-volume/lvm/migrate/

[2] https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

[3] https://docs.ceph.com/en/latest/ceph-volume/lvm/newwal/

OpenInfra 2022 Berlin

May 4, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

Wed, June 8, 2:50pm - 3:20pm | Berlin Congress Center - B - B09

Ceph on WindowsPrivate & Hybrid Cloud

Ceph RADOS, RBD and CephFS have been ported on Microsoft Windows, a community effort led by SUSE and Cloudbase Solutions. The goal consisted in porting librados and librdb on Windows Server, providing a kernel driver for exposing RBD devices natively as Windows volumes, support for Hyper-V VMs and last but not least, even CephFS. During this session we will talk about the architectural differences between Windows and Linux from a storage standpoint and how we retained the same CLI so that long time Ceph users will feel at home regardless of the underlying operating system. Performance is a key aspect of this porting, with Ceph on Windows significantly outperforming the iSCSI gateway, previously the main option for accessing RBD images from Windows nodes. There will be no lack of live demos, including automating the installation of the Windows binaries, setting up and managing a Ceph cluster across Windows and Linux nodes, spinning up Hyper-V VMs from RBD, and CephFS.

openinfra.dev/summit-schedule

Erasure Coding and the data chunk size

April 26, 2022 · 3 min read

Joachim Kraftmayer

Managing Director at Clyso

Follow the recommendation

We have seen many different EC profiles over the last 10+ years, but few that follow the official ceph.io recommendations.

We generally recommend min_size be K+2 or more to prevent loss of writes and data.
docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery

Erasure Coding vs RAID in production

Erasure coding and RAID (e.g. RAID5, RAID6,…) are often compared to data chunks and coding chunks because of the architecture.

However, they differ considerably from each other in productive use.

global rule vs data set rule

In software or hardware RAID, the number of hard disks, including hotspare, for storing all data is fixed.
In Ceph, however, e.g. with an EC profile of 8 + 3 and Failure Domain HOST, a total of 11 servers with one hard disk each are involved in storing a data set.
For the next data set, other servers or other hard disks are used.

Key facts for ceph recovery

If a hard disk fails, another hard disk is immediately allocated as the storage location.

time

The decisive factor for data security is how long it takes to restore the data and how high the probability is that other hard disks will fail during the recovery period.

Risk

Further failures extend the recovery period or if more than 3 hard disks fail, data loss occurs for the part of the data stored there.

SIZING

The recovery time depends on physical components such as the number of available hard disks, their fill level and the throughput.

CONFIG

The recovery behaviour is also significantly influenced by the correct choice of Ceph configuration parameters.
Care should always be taken to ensure that the parameters are in relation to the physical hardware, e.g. priority of the recovery in relation to the response to client requests during operation:

priority of the recovery in relation to the response to client requests during operation.
optimal choice of PGs for the distribution of data
distinction between SLA for read access and write access
…

Our general opinion on EC is

Originally I didn't like it much and would like to avoid it whenever possible. Mainly because it's much more complicated (more bugs), much harder to restore ("partial" restore is not possible) and performance is usually worse. But "saved space" sounds too tempting at first glance. With that said, it is inevitable in the future and there are actually cases where it is fine and can even work better than a replicated pool, e.g. when storing large data such as backup tarballs or videos, or when the writes are aligned to the stripe width (i.e. the application needs to know how to write effectively).

Sources

docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery

How To Identify OSD(s) affected by PG Dup Bug

April 25, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

follow-up to osd(s) with unlimited ram growth

There is a way to check a ceph cluster if there are any OSDs affected by the "PG Dup Bug" by running following command:

ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.&#91;0-9]'

This will provide you a list of all OSDs in the cluster containing 2 Parameters:

osd_pglog_bytes
osd_pglog_items

"osd_pglog_items" counter is a sum of "normal" log entries, dup entries and some other things. Taking that osd_target_pg_log_entries_per_osd is 300.000 by default, we may assume that about 300.000 items are "normal" pg log entries, and if "osd_pglog_items" counter is much higher than this it is most likely due to dups. Example:

osd.32: { "osd_pglog_bytes": 1925908608, "osd_pglog_items": 17418324 }

osd_pglog_items = 17.418.324 - 300.000 = probably about 17 Million PG Dups

Running a manual check against this OSD with the commands in with unlimited ram growth revealed 1 PG with 17.090.093 entries. So this is a quick and easy way to identify problematic OSD(s) without the need to stop all OSDs and manually run commands.

Sources:

github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L2951

kubernetes worker node with hugepages

April 21, 2022 · 2 min read

Joachim Kraftmayer

Managing Director at Clyso

What are hugepages?

For example, x86 CPUs normally support 4K and 2M (1G if architecturally supported) page sizes, ia64 architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical translations. Typically this is a very scarce resource on processor. Operating systems try to make best use of limited number of TLB resources. This optimization is more critical now as bigger and bigger physical memories (several GBs) are more readily available. https://www.kernel.org/doc/> Documentation/vm/hugetlbpage.txt

How to configure huge pages

clyso@compute-21:~$ grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
clyso@compute-21:~$

echo 1024 > /proc/sys/vm/nr_hugepages

echo "vm.nr_hugepages=1024" &gt; /etc/sysctl.d/hugepages.conf

total huge pages

clyso@compute-21:/etc/sysctl.d# grep HugePages_Total /proc/meminfo
HugePages_Total: 1024
clyso@compute-21:/etc/sysctl.d#

free hugepages

clyso@compute-21:/etc/sysctl.d# grep HugePages_Free /proc/meminfo
HugePages_Free: 1024
clyso@compute-21:/etc/sysctl.d#

free memory

clyso@compute-21:/etc/sysctl.d# grep MemFree /proc/meminfo
MemFree: 765177380 kB
clyso@compute-21:/etc/sysctl.d#

How to make huge pages available in kubernetes?

restart kubernetes kublet on worker node

sudo systemctl restart kubelet.service

verify in kubernetes

Allocated resources

clyso@compute-21:~$ kubectl describe node compute-21 | grep -A 8 "Allocated resources:"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                4950m (10%)   15550m (32%)
  memory             27986Mi (3%)  292670Mi (37%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      400Mi (19%)   400Mi (19%)
clyso@compute-21:~$

Capacity

clyso@compute-21:~$ kubectl describe node compute-21 | grep -A 13 "Capacity:"
Capacity:
  cpu:                48
  ephemeral-storage:  1536640244Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      2Gi
  memory:             792289900Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  1416167646526
  hugepages-1Gi:      0
  hugepages-2Mi:      2Gi
  memory:             790090348Ki
  pods:               110
clyso@compute-21:~$

Sources:

Manage HugePages
Brief summary of hugetlbpage support in the Linux kernel
Configuring Huge Pages in Red Hat Enterprise Linux 4 or 5

Ceph Quincy - April 19, 2022

April 20, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

Quincy is the 17th stable release of Ceph. It is named after Squidward Quincy Tentacles fromSpongebob Squarepants.

Sources

ceph.io/en/news/blog/2022/v17-2-0-quincy-released/

osd(s) with unlimited ram growth

April 12, 2022 · 4 min read

Joachim Kraftmayer

Managing Director at Clyso

We have been working with Ceph for more than 10 years and have observed the following behaviour of osds on different Ceph clusters for several years:

osds start very slowly, up to 10 minutes until ceph osd up was reported
increased osd memory consumption up to 8GB, with default osd_memory target of 4GB
individual hosts that wanted to consume so much main memory that the OOM killer of the Linux kernel terminated the process. Main memory consumption of individual osds tested up to 150 GB.
Complete Ceph cluster with all osds that could no longer be started successfully because all osds wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.

There were also many messages in the Ceph mailing list about similar problems, which were analysed extensively. However, they never tracked down the root cause of the errors. Often the problem simply resolved itself or the affected osds were removed and reinstalled.

Together with affected people and colleagues from the community, we were now able to permanently investigate the bug and found the root cause.
tracker.ceph.com/issues/53729

Root Cause

There are dup entries with a version higher than the log entries. This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. tracker.ceph.com/issues/53729#note-57

Affected Ceph versions: all versions that have dups (jewel or luminous and later).

Possible explanation

Possible explanation why we can observe it more and more often is that the autoscaler was not active by default.

Our experience of the last weeks is that we find osds with several million pg dup entries on many Ceph clusters we maintain. The current peak is osds with over 50 million entries (octopus release).

Mitigation

We have developed some tools to mitigate the problem.

This is ceph-objectstore-tool built with the patches from PR github.com/ceph/ceph/pull/45529 See built packages at shaman.ceph.com/repos/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ E.g. this is for bionic 2.chacra.ceph.com/r/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ubuntu/bionic/flavors/default/

Note: Shaman is used for testing on teuthology and I am not sure how long the packages remain available there.

Mitigation process

Identify OSD affected, e.g. long bootup after restart

set NOOUT, stop OSD and mount it

ceph osd set noout
systemctl stop ceph-FSID@osd.OSD.service
ceph-volume lvm activate OSD OSD-FSID --no-systemd

create list of PGs on OSD

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op list-pgs > osd.11.pgs.txt

Check for DUPs on all PGs

while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11/ --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.11.pgs.txt 2>&1 | tee dups.log

run tool on affected PGs. Check Memory Usage - it depends on the parameter "osd_pg_log_trim_max" - we observed around 4G with osd_pg_log_trim_max=500000. we identified osd_pg_log_trim_max=500000 as the optimum value. further increasing osd_pg_log_trim_max will not speed up the process.
```
time ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op trim-pg-log --pgid PG --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
```

Recommendation

In the example above, osd_pg_log_trim_max is already very high, increasing it further would not increase the speed. For safety reasons, we recommend starting with a smaller number, e.g. 100000. From the experience of the first run, you can then think about increasing or accelerating the speed.

Notice

Special care should be taken with the following actions:

Ceph Cluster Upgrade
pg split or pg merge- manually or automatically via autoscaler
use of the current patch

The listed actions can unintentionally trigger the trimming of the pg dups, which can lead to osds being inaccessible as long as they perform the trim action and can consume an enormous amount of main memory.

This could have a particularly extreme effect on osd in connection with rook and the kubernetes liveness probe.

What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? github.com/ceph/ceph/pull/45529

Cephalocon 2022 in Portland postponed

February 10, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

ceph.io/en/news/blog/2022/cephalocon-postponed

CentOS 8 end of life

January 10, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

CentOS Linux 8 will reach End Of Life (EOL) on December 31st, 2021. Here’s what that means.

CentOS Linux EOL

Update​

Appendix​

Sources​

Ceph on WindowsPrivate & Hybrid Cloud​

Follow the recommendation​

Erasure Coding vs RAID in production​

global rule vs data set rule​

Key facts for ceph recovery​

time​

Risk​

SIZING​

CONFIG​

Our general opinion on EC is​

Sources​

What are hugepages?​

How to configure huge pages​

How to make huge pages available in kubernetes?​

restart kubernetes kublet on worker node​

verify in kubernetes​

Allocated resources​

Capacity​

Sources​

Root Cause​

Possible explanation​

Mitigation​

Mitigation process​

Recommendation​

Notice​

Update

Appendix

Sources

Ceph on WindowsPrivate & Hybrid Cloud

Follow the recommendation

Erasure Coding vs RAID in production

global rule vs data set rule

Key facts for ceph recovery

time

Risk

SIZING

CONFIG

Our general opinion on EC is

Sources

What are hugepages?

How to configure huge pages

How to make huge pages available in kubernetes?

restart kubernetes kublet on worker node

verify in kubernetes

Allocated resources

Capacity

Sources

Root Cause

Possible explanation

Mitigation

Mitigation process

Recommendation

Notice