50 posts tagged with "operation"

osd(s) with unlimited ram growth

April 12, 2022 · 4 min read

Managing Director at Clyso

We have been working with Ceph for more than 10 years and have observed the following behaviour of osds on different Ceph clusters for several years:

osds start very slowly, up to 10 minutes until ceph osd up was reported
increased osd memory consumption up to 8GB, with default osd_memory target of 4GB
individual hosts that wanted to consume so much main memory that the OOM killer of the Linux kernel terminated the process. Main memory consumption of individual osds tested up to 150 GB.
Complete Ceph cluster with all osds that could no longer be started successfully because all osds wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.

There were also many messages in the Ceph mailing list about similar problems, which were analysed extensively. However, they never tracked down the root cause of the errors. Often the problem simply resolved itself or the affected osds were removed and reinstalled.

Together with affected people and colleagues from the community, we were now able to permanently investigate the bug and found the root cause.
tracker.ceph.com/issues/53729

Root Cause

There are dup entries with a version higher than the log entries. This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. tracker.ceph.com/issues/53729#note-57

Affected Ceph versions: all versions that have dups (jewel or luminous and later).

Possible explanation

Possible explanation why we can observe it more and more often is that the autoscaler was not active by default.

Our experience of the last weeks is that we find osds with several million pg dup entries on many Ceph clusters we maintain. The current peak is osds with over 50 million entries (octopus release).

Mitigation

We have developed some tools to mitigate the problem.

This is ceph-objectstore-tool built with the patches from PR github.com/ceph/ceph/pull/45529 See built packages at shaman.ceph.com/repos/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ E.g. this is for bionic 2.chacra.ceph.com/r/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ubuntu/bionic/flavors/default/

Note: Shaman is used for testing on teuthology and I am not sure how long the packages remain available there.

Mitigation process

Identify OSD affected, e.g. long bootup after restart

set NOOUT, stop OSD and mount it

ceph osd set noout
systemctl stop ceph-FSID@osd.OSD.service
ceph-volume lvm activate OSD OSD-FSID --no-systemd

create list of PGs on OSD

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op list-pgs > osd.11.pgs.txt

Check for DUPs on all PGs

while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11/ --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.11.pgs.txt 2>&1 | tee dups.log

run tool on affected PGs. Check Memory Usage - it depends on the parameter "osd_pg_log_trim_max" - we observed around 4G with osd_pg_log_trim_max=500000. we identified osd_pg_log_trim_max=500000 as the optimum value. further increasing osd_pg_log_trim_max will not speed up the process.
```
time ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op trim-pg-log --pgid PG --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
```

Recommendation

In the example above, osd_pg_log_trim_max is already very high, increasing it further would not increase the speed. For safety reasons, we recommend starting with a smaller number, e.g. 100000. From the experience of the first run, you can then think about increasing or accelerating the speed.

Notice

Special care should be taken with the following actions:

Ceph Cluster Upgrade
pg split or pg merge- manually or automatically via autoscaler
use of the current patch

The listed actions can unintentionally trigger the trimming of the pg dups, which can lead to osds being inaccessible as long as they perform the trim action and can consume an enormous amount of main memory.

This could have a particularly extreme effect on osd in connection with rook and the kubernetes liveness probe.

What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? github.com/ceph/ceph/pull/45529

Ceph message „daemons have recently crashed“

June 20, 2021 · One min read

Joachim Kraftmayer

Managing Director at Clyso

The crash module collects information about daemon crashdumps and stores it in the Ceph cluster for later analysis.

If you see this message in the status of Ceph (ceph -s), you should first execute the following command to list all collected crashes:

ceph crash ls

Here you can see in the output which OSD(s) had or have problems with the respective time of occurrence.

You can get more information with the help of

ceph crash info <ID>

for the respective crash event.

If the crash is no longer relevant it can be confirmed with the following two commands:

ceph crash archive

ceph crash archive-all

After that the warning disappears from the ceph status output.

Sources

https://docs.ceph.com/en/quincy/mgr/crash/

Howto create an erasure coded rbd pool

May 31, 2021 · One min read

Joachim Kraftmayer

Managing Director at Clyso

Create an erasure coded rbd pool

ceph osd pool create ec-pool 1024 1024 erasure 8-3
ceph osd pool set data01 allow_ec_overwrites true
rbd pool init ec-pool

note

Many things can be changed later in ceph during the runtime. However, the settings for the distribution of data and coding chunks must be defined when the EC pool is created. This means you should think carefully about what you plan to do with the pool in the future.

Create a erasure coded rbd image, in the EC data pool and for the metadata (OMAP objects) you need the replicated target-pool:

rbd create --size 25G --data-pool ec-pool/origin-image target-pool/new-image

rbd info target-pool/new-image

Sources

docs.ceph.com/en/latest/rados/operations/erasure-code/

Install Ceph ISCSI Gateways under Debian with ceph-iscsi

March 31, 2021 · 3 min read

Joachim Kraftmayer

Managing Director at Clyso

Preliminary remark: Perhaps some people still know the ceph-iscsi project under the name ceph-iscsi-cli.

Installation of necessary Debian packages

apt install ca-certificates

apt install librbd1 libkmod2 python-pyparsing python-kmodpy python-pyudev python-gobject python-urwid python-rados python-rbd python-netifaces python-crypto python-requests python-flask python-openssl python-rpm ceph-common

Ceph setup with pool and user

iscsi-ceph takes over the administration between iscsi devices and the conversion to rbd images. For this we need a separate ceph pool and a separate user. Contrary to the standard documentation, I do not use client.admin but create a restricted user client.iscsi.

Pool

The standard pool has the name rbd, here we give it the name iscsi.

ceph osd pool create <pool-name> 2048 2048 replicated <rule-name>

User

The user iscsi is created with the necessary authorizations for rbd on the pool iscsi

ceph auth add client.iscsi mon 'profile rbd' osd 'profile rbd pool=\<pool-name>'

Installation of necessary Debian packages for ceph-iscsi

apt install tcmu-runner targetcli-fb python-rtslib-fb

Manuelle Installation ceph-iscsi

apt install git

git clone https://github.com/ceph/ceph-iscsi.git

apt install python-setuptools python-configshell-fb

apt install librbd1 libkmod2 python-pyparsing python-kmodpy python-pyudev python-gobject python-urwid python-rados python-rbd python-netifaces python-crypto python-requests python-flask python-openssl python-rpm ceph-common

cd ceph-iscsi
python setup.py install --install-scripts=/usr/bin

cp usr/lib/systemd/system/rbd-target-gw.service /lib/systemd/system
cp usr/lib/systemd/system/rbd-target-api.service /lib/systemd/system

systemctl daemon-reload
systemctl enable rbd-target-gw
systemctl start rbd-target-gw
systemctl enable rbd-target-api
systemctl start rbd-target-api

ISCSI configuration

[config] name of the *.conf file. A suitable conf file allowing access to the ceph cluster from the gateway node is required. cluster_name = ceph Pool name where internal gateway.conf object is stored pool = rbd pool = rbd CephX client name cluster_client_name = client. # E.g.: client.admin cluster_client_name = client.iscsi API settings. The api supports a number of options that allow you to tailor it to your local environment. If you want to run the api under https, you will need to create crt/key files that are compatible for each gateway node (i.e. not locked to a specific node). SSL crt and key files must be called iscsi-gateway.crt and iscsi-gateway.key and placed in /etc/ceph on each gateway node. With the SSL files in place, you can use api_secure = true to switch to https mode. To support the api, the bear minimum settings are; api_secure = false Additional API configuration options are as follows (defaults shown); api_user = admin api_password = admin api_port = 5000 trusted_ip_list = IP,IP trusted_ip_list = 10.27.252.176, 127.0.0.1 Refer to the ceph-iscsi-config/settings module for more options logger_level=DEBUG

Sources

github.com/ceph/ceph-iscsi https://docs.ceph.com/docs/master/rbd/iscsi-initiator-esx/ https://docs.ceph.com/docs/master/rbd/iscsi-target-cli-manual-install/ https://docs.ceph.com/docs/luminous/rbd/iscsi-target-cli/

how to recover accidentally deleted client.admin keyring

December 2, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

you can do it with the client.admin but i prefer to create a seperate recovery client.

cephadm docker host:

ceph -n mon. --keyring /var/lib/ceph/&lt;fsid&gt;/mon/&lt;mon-name&gt;/keyring get-or-create client.recovery mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *'

ceph standard host:

ceph -n mon. --keyring /var/lib/ceph/mon/&lt;mon-name&gt;/keyring get-or-create client.recovery mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *'

install ceph-common:

apt install ceph-common

create two files:

/etc/ceph/ceph.conf

[global]
fsid=<you find the ceph_fsid file in each path of osd, mon or mgr>
mon_host = [v2:<ip addr of the active ceph monitor>;:3300/0,v1:<ip addr of the active ceph monitor>:6789/0]

/etc/ceph/ceph.client.recovery.keyring (add the output of the ceph get-or-create command. replace the : with = and set the name in [])

erasure coding, recovery under min_size

December 1, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

Ceph Octopus with version 15.2.0 attempts to restore data even if the min_size for objects in EC pools is not reached. Of course, only the objects for which sufficient shards are still available.

sources

docs.ceph.com/en/latest/releases/octopus/#v15-2-0-octopus

Windows drivers for RBD and maybe soon for cephfs

December 1, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

presentation from 2017 was presented again at SUSECON Digital 2020.

www.youtube.com/watch?v=BWZIwXLcNts

download

beta.suse.com/private/SLE15/SP2/download/SES7/SES4Win/?_ga=2.220676824.150232264.1610460395-709272379.1610460395

sources

suse.com/betaprogram/suse-enterprise-storage-windows-driver-beta/

Add OSD Nodes to a Ceph Cluster via Ansible

October 11, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

This guide will detail the process of adding OSD nodes to an existing cluster running RedHat Enterprise Storage 4 (Nautilus). The process can be completed without taking the cluster out of production.

Set ceph cluster into maintenance mode

ceph osd set norebalance

ceph osd set nobackfill 

ceph osd set norecover

Verify ceph cluster status

ceph status

Make sure that the new ceph node is defined in the /etc/hosts file.

vim /usr/share/ceph-ansible/hosts

[mons]
...
[mgrs]
...
[osds]
ceph-node1
ceph-node2
ceph-node3
ceph-node4
...

ping test before ansible playbook execution


ansible-playbook site-conatiner.yml --limit ceph-node4

unset maintenance mode

ceph osd unset nobackfill

ceph osd unset norecover

ceph osd unset norebalance

verify added Check that all Osds with hard drives have been added as expected

ceph osd tree
ceph osd crush tree
ceph osd df
ceph -s

verify all services uses the same version

ceph versions

sources

docs.ceph.com/projects/ceph-ansible/en/latest/day-2/osds.html

docs.ceph.com/projects/ceph-ansible/en/latest/

ceph get erasure coding pool profile

September 25, 2020 · 2 min read

Joachim Kraftmayer

Managing Director at Clyso

Perhaps someone has already thought about using EC (erasure coding) for ceph pools, so that the overhead for the secure storage of data is not too high. This was already a topic in many of the trainings we have held in recent years.

But what most people forget after creating EC pools is how to get all the information about an existing pool.

ceph osd pool ls

ceph osd pool ls detail

don't really give information about the configuration of erasure coding pools. However, there is a small option that lets ceph spill the beans a bit more.

ceph osd pool ls detail --format=json

you might get more information than you want.

But with

ceph osd pool ls detail --format=json | jq '.'

the whole thing looks much more friendly to the eyes.

And here we find more information about the erasure coded pools:

ceph osd pool ls detail --format=json | jq '.' | grep erasure_code_profile
erasure_code_profile": "clyso-costum-profile",

If you want to list all defined profiles, then use

ceph osd erasure-code-profile ls

You can get detailed information about an erasure code profile with:

ceph osd erasure-code-profile get clyso-costum-profile

ceph previous 14.2.12 - profile rbd does not allow the use of RBD_INFO

September 8, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

We had the problem of getting the correct authorizations for the Ceph CSI user on the pools.

We then found the following bug for the version prior to 14.2.12.

https://github.com/ceph/ceph/pull/36413/files#diff-1ad4853f970880c78ea0e52c81e621b4

Was then solved with version 14.2.12.

https://tracker.ceph.com/issues/46321

Root Cause​

Possible explanation​

Mitigation​

Mitigation process​

Recommendation​

Notice​

Sources​

Create an erasure coded rbd pool​

note​

Sources​

Installation of necessary Debian packages​

Ceph setup with pool and user​

Pool​

User​

Manuelle Installation ceph-iscsi​

ISCSI configuration​

Sources​

sources​

download​

sources​

Set ceph cluster into maintenance mode​

Verify ceph cluster status​

unset maintenance mode​

verify added Check that all Osds with hard drives have been added as expected​

verify all services uses the same version​

sources​

Root Cause

Possible explanation

Mitigation

Mitigation process

Recommendation

Notice

Sources

Create an erasure coded rbd pool

note

Sources

Installation of necessary Debian packages

Ceph setup with pool and user

Pool

User

Manuelle Installation ceph-iscsi

ISCSI configuration

Sources

sources

download

sources

Set ceph cluster into maintenance mode

Verify ceph cluster status

unset maintenance mode

verify added Check that all Osds with hard drives have been added as expected

verify all services uses the same version

sources