24 posts tagged with "osd"

ceph-volume - ceph osd migrate DB to larger ssd/flash device

May 4, 2022 · One min read

Managing Director at Clyso

But, I already mentioned it (for a bit different case) in newer versions there is ceph-volume lvm migrate [1] which I think allows to do the same but in much simpler way. I have not tried it to and the documentation is not very clear to me so one need to experiment with this before writing exact instructions. We might also need to use new-db [2] and new-wal [3] commands before running migrate but I am not sure they are needed for this particular case.

[1] https://docs.ceph.com/en/latest/ceph-volume/lvm/migrate/

[2] https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

[3] https://docs.ceph.com/en/latest/ceph-volume/lvm/newwal/

How To Identify OSD(s) affected by PG Dup Bug

April 25, 2022 · One min read

Joachim Kraftmayer

Managing Director at Clyso

follow-up to osd(s) with unlimited ram growth

There is a way to check a ceph cluster if there are any OSDs affected by the "PG Dup Bug" by running following command:

ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.&#91;0-9]'

This will provide you a list of all OSDs in the cluster containing 2 Parameters:

osd_pglog_bytes
osd_pglog_items

"osd_pglog_items" counter is a sum of "normal" log entries, dup entries and some other things. Taking that osd_target_pg_log_entries_per_osd is 300.000 by default, we may assume that about 300.000 items are "normal" pg log entries, and if "osd_pglog_items" counter is much higher than this it is most likely due to dups. Example:

osd.32: { "osd_pglog_bytes": 1925908608, "osd_pglog_items": 17418324 }

osd_pglog_items = 17.418.324 - 300.000 = probably about 17 Million PG Dups

Running a manual check against this OSD with the commands in with unlimited ram growth revealed 1 PG with 17.090.093 entries. So this is a quick and easy way to identify problematic OSD(s) without the need to stop all OSDs and manually run commands.

Sources:

github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L2951

osd(s) with unlimited ram growth

April 12, 2022 · 4 min read

Joachim Kraftmayer

Managing Director at Clyso

We have been working with Ceph for more than 10 years and have observed the following behaviour of osds on different Ceph clusters for several years:

osds start very slowly, up to 10 minutes until ceph osd up was reported
increased osd memory consumption up to 8GB, with default osd_memory target of 4GB
individual hosts that wanted to consume so much main memory that the OOM killer of the Linux kernel terminated the process. Main memory consumption of individual osds tested up to 150 GB.
Complete Ceph cluster with all osds that could no longer be started successfully because all osds wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.

There were also many messages in the Ceph mailing list about similar problems, which were analysed extensively. However, they never tracked down the root cause of the errors. Often the problem simply resolved itself or the affected osds were removed and reinstalled.

Together with affected people and colleagues from the community, we were now able to permanently investigate the bug and found the root cause.
tracker.ceph.com/issues/53729

Root Cause

There are dup entries with a version higher than the log entries. This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. tracker.ceph.com/issues/53729#note-57

Affected Ceph versions: all versions that have dups (jewel or luminous and later).

Possible explanation

Possible explanation why we can observe it more and more often is that the autoscaler was not active by default.

Our experience of the last weeks is that we find osds with several million pg dup entries on many Ceph clusters we maintain. The current peak is osds with over 50 million entries (octopus release).

Mitigation

We have developed some tools to mitigate the problem.

This is ceph-objectstore-tool built with the patches from PR github.com/ceph/ceph/pull/45529 See built packages at shaman.ceph.com/repos/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ E.g. this is for bionic 2.chacra.ceph.com/r/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ubuntu/bionic/flavors/default/

Note: Shaman is used for testing on teuthology and I am not sure how long the packages remain available there.

Mitigation process

Identify OSD affected, e.g. long bootup after restart

set NOOUT, stop OSD and mount it

ceph osd set noout
systemctl stop ceph-FSID@osd.OSD.service
ceph-volume lvm activate OSD OSD-FSID --no-systemd

create list of PGs on OSD

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op list-pgs > osd.11.pgs.txt

Check for DUPs on all PGs

while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11/ --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.11.pgs.txt 2>&1 | tee dups.log

run tool on affected PGs. Check Memory Usage - it depends on the parameter "osd_pg_log_trim_max" - we observed around 4G with osd_pg_log_trim_max=500000. we identified osd_pg_log_trim_max=500000 as the optimum value. further increasing osd_pg_log_trim_max will not speed up the process.
```
time ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op trim-pg-log --pgid PG --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000
```

Recommendation

In the example above, osd_pg_log_trim_max is already very high, increasing it further would not increase the speed. For safety reasons, we recommend starting with a smaller number, e.g. 100000. From the experience of the first run, you can then think about increasing or accelerating the speed.

Notice

Special care should be taken with the following actions:

Ceph Cluster Upgrade
pg split or pg merge- manually or automatically via autoscaler
use of the current patch

The listed actions can unintentionally trigger the trimming of the pg dups, which can lead to osds being inaccessible as long as they perform the trim action and can consume an enormous amount of main memory.

This could have a particularly extreme effect on osd in connection with rook and the kubernetes liveness probe.

What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? github.com/ceph/ceph/pull/45529

Add OSD Nodes to a Ceph Cluster via Ansible

October 11, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

This guide will detail the process of adding OSD nodes to an existing cluster running RedHat Enterprise Storage 4 (Nautilus). The process can be completed without taking the cluster out of production.

Set ceph cluster into maintenance mode

ceph osd set norebalance

ceph osd set nobackfill 

ceph osd set norecover

Verify ceph cluster status

ceph status

Make sure that the new ceph node is defined in the /etc/hosts file.

vim /usr/share/ceph-ansible/hosts

[mons]
...
[mgrs]
...
[osds]
ceph-node1
ceph-node2
ceph-node3
ceph-node4
...

ping test before ansible playbook execution


ansible-playbook site-conatiner.yml --limit ceph-node4

unset maintenance mode

ceph osd unset nobackfill

ceph osd unset norecover

ceph osd unset norebalance

verify added Check that all Osds with hard drives have been added as expected

ceph osd tree
ceph osd crush tree
ceph osd df
ceph -s

verify all services uses the same version

ceph versions

sources

docs.ceph.com/projects/ceph-ansible/en/latest/day-2/osds.html

docs.ceph.com/projects/ceph-ansible/en/latest/

ceph auth error

August 31, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

We see this example again and again with customers who copy their keyring file directly from the output of:

 ceph auth ls

In the client.\<name\>.keyring the name is enclosed in square brackets and the key is separated by an equal sign and in the ceph auth ls by a colon.

ceph bug of the year 2020 - CERN

June 29, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

interesting insights into how dependency on external operating system libraries can affect the operation of Ceph.

https://www.youtube.com/watch?v=_4HUR00oCGo

https://codimd.web.cern.ch/p/rkNZH4JR8?print-pdf#/

speed up or slow down ceph recovery

June 12, 2020 · One min read

Joachim Kraftmayer

Managing Director at Clyso

osd max backfills: This is the maximum number of backfill operations allowed to/from OSD. The higher the number, the quicker the recovery, which might impact overall cluster performance until recovery finishes.
osd recovery max active: This is the maximum number of active recover requests. Higher the number, quicker the recovery, which might impact the overall cluster performance until recovery finishes.
osd recovery op priority: This is the priority set for recovery operation. Lower the number, higher the recovery priority. Higher recovery priority might cause performance degradation until recovery completes.

ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=2

Recommendation

Start in small steps, observe the Ceph status, client IOPs and throughput and then continue to increase in small steps.

In the producton with regard to the applications and hardware infrastructure, we recommend setting these settings back to default as soon as possible.

Sources

https://www.suse.com/support/kb/doc/?id=000019693

Bluestore Metadata Database sizing

May 6, 2019 · One min read

Joachim Kraftmayer

Managing Director at Clyso

RocksDB size targets are usually exponentially increasing:

300 MB, 3GB, 30GB, 300GB, ...

SST = Static Sorted Table

BlockBasedTable Format is rocksDB default SST.

github.com/facebook/rocksdb/wiki/Leveled-Compaction

github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format

verify ceph osd DB and WAL setup

March 14, 2019 · One min read

Joachim Kraftmayer

Managing Director at Clyso

When configuring osd in mixed setup with db and wal colocated on a flash device, ssd or NVMe. There were always changes and irritations where the DB and the WAL are really located. With a simple test it can be checked: The location of the DB for the respective OSD can be verified via ceph osd metadata osd.<id> and the variable "bluefs_dedicated_db": "1".

The WAL was created separately in earlier Ceph versions and automatically on the same device as the DB in later Ceph versions. The WAL can be easily tested by using the ceph osd.<id> tell bench command.

First you check larger write operations with the command:

ceph tell osd.0 bench 65536 409600

Second, you check with smaller objects that are smaller than the bluestore_prefer_deferred_size_hdd (64k).

ceph tell osd.0 bench 65536 4096

If you compare the IOPs of the two tests, one result should correspond to the IOPs of an SSD and the other result should be quite low for the HDD. From this you can know if the WAL is on the HDD or the flash device.

RocksDB - Leveled Compaction

February 24, 2019 · One min read

Joachim Kraftmayer

Managing Director at Clyso

Bluestore/RocksDB will only put the next level up size of DB on flash if the whole size will fit. These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are pointless. Only ~3GB of SSD will ever be used out of a 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will be used.

How do I find the right SSD/NVMe partition size for hot DB.

https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

Root Cause​

Possible explanation​

Mitigation​

Mitigation process​

Recommendation​

Notice​

Set ceph cluster into maintenance mode​

Verify ceph cluster status​

unset maintenance mode​

verify added Check that all Osds with hard drives have been added as expected​

verify all services uses the same version​

sources​

Recommendation​

Sources​

Root Cause

Possible explanation

Mitigation

Mitigation process

Recommendation

Notice

Set ceph cluster into maintenance mode

Verify ceph cluster status

unset maintenance mode

verify added Check that all Osds with hard drives have been added as expected

verify all services uses the same version

sources

Recommendation

Sources