Skip to main content

47 posts tagged with "operation"

View All Tags

· One min read
Joachim Kraftmayer

get config (default: 4G)

ceph daemon mds.<mds-id> config get mds_cache_memory_limit
ceph daemon /var/run/ceph/<fsid>/<mds-id> config get mds_cache_memory_limit
ceph tell mds.storefs-a config show |grep mds_cache_memory_limit

set config on the fly not persistent (to 64 GB)

ceph daemon mds.<mds.id> config set mds_cache_memory_limit 68719476736
ceph daemon /var/run/ceph/<fsid>/<mds-id> set mds_cache_memory_limit 68719476736
ceph tell mds.storefs-a injectargs --mds_cache_memory_limit 68719476736

persist config ( to 64 GB)

ceph config set mds mds_cache_memory_limit 68719476736

· One min read
Joachim Kraftmayer

We have two options to get the gateway.conf:

gwcli

gwcli export mode=copy

or

rados

rados -p iscsi get gateway.conf /root/gateway.conf

At the moment there is no way to update or write the gateway.conf via gwcli command. So the only option is to use the rados command line tool.

note

Be careful when editing the content manually, it requires care and expertise.

sources

docs.ceph.com/en/latest/man/8/rados

manpages.ubuntu.com/manpages/jammy/man8/gwcli.8.html

· 2 min read
Joachim Kraftmayer

You might have wondered how to get rid of the warning "ceph warning - pools have many more objects per pg than average", because you want to see your cluster in HEALTH_OK status. The option to change the thresholds for this warning is: mon_pg_warn_max_object_skew

Especially for start in production of a ceph cluster or a new pool you can set the threshold high. After a certain time you should always check the value and adjust it if necessary.

An important note for the option is that it must be set on ceph mgr, often you can find posts that set the option on ceph mon and see no effect on ceph status.

The cluster status commands 

ceph status

or

ceph health detail

shows the following warning:

[WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than average pool test objects per pg (2079) is more than 11.6798 times cluster average (178)

note

To disable the warning completely the value of mon_pg_warn_max_object_skew must be set to 0 or a negative number.

Verify the default value

ceph config get mgr mon_pg_warn_max_object_skew10.000000

inject the value:

ceph tell mgr.a injectargs '--mon_pg_warn_max_object_skew 50'

verify the value:

ceph tell mgr.a config get mon_pg_warn_max_object_skew
{
"mon_pg_warn_max_object_skew": "50.000000"
}

set the value persistent, e.g. 50 times higher

ceph config set mgr mon_pg_warn_max_object_skew 50
ceph config get mgr mon_pg_warn_max_object_skew
50.000000

· 2 min read
Joachim Kraftmayer

ceph-volume can be used to create for a existing OSD a new WAL/DB on a faster device without the need to recreate the OSD.

ceph-volume lvm new-db --osd-id 15 --osd-fsid FSID --target cephdb/cephdb1
--> NameError: name 'get_first_lv' is not defined
this is a bug in ceph-volume v16.2.7 that will be fixed in v16.2.8
[https://github.com/ceph/ceph/pull/44209](https://github.com/ceph/ceph/pull/44209)

First create a new Logical Volume on the Device that will hold the new WAL/DB

vgcreate cephdb /dev/sdb
Volume group "cephdb" successfully created
lvcreate -L 100G -n cephdb1 cephdb
Logical volume "cephdb1" created.

Now stop running OSD and if it was deactivated ( cephadm ) then activate it on the host

systemctl stop ceph-FSID@osd.0.service
ceph-volume lvm activate --all --no-systemd

Create new WAL/DB on new Device

ceph-volume lvm new-db --osd-id 0 --osd-fsid OSD-FSID --target cephdb/cephdb1
--> Making new volume at /dev/cephdb/cephdb1 for OSD: 0 (/var/lib/ceph/osd/ceph-0)
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-1
--> New volume attached.

Migrate existing WAL/DB to new Device

ceph-volume lvm migrate --osd-id 0 --osd-fsid OSD-FSID --from data --target cephdb/cephdb1
--> Migrate to existing, Source: ['--devs-source', '/var/lib/ceph/osd/ceph-0/block'] Target: /var/lib/ceph/osd/ceph-0/block.db
--> Migration successful.

Deactivate OSD and start it

ceph-volume lvm deactivate 0
Running command: /bin/umount -v /var/lib/ceph/osd/ceph-0
stderr: umount: /var/lib/ceph/osd/ceph-0 unmounted
systemctl start ceph-FSID@osd.0.service

· 2 min read
Joachim Kraftmayer

First we wanted to use ceph-bluestore-tool bluefs-bdev-new-wal. However, it turned out that it is not possible to ensure that the second DB is actually used. For this reason, we decided to migrate the entire bluefs of the osd to an ssd/flash.

bluestore

Verify the current osd bluestore setup

ceph-bluestore-tool show-label –dev device []

Verify the current size of the osd bluestore DB

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>
ceph-bluestore-tool bluefs-bdev-migrate –path osd path –dev-target new-device –devs-source device1 [–devs-source device2]

Verify the size of the osd bluestore DB after the migration

ceph-bluestore-tool  bluefs-bdev-sizes –path <osd path>

if the size does not correspond to the new target size execute the following command:

ceph-bluestore-tool bluefs-bdev-expand –path osd path

Instruct BlueFS to check the size of its block devices and, if they have expanded, make use of the additional space. Please note >that only the new files created by BlueFS will be allocated on the preferred block device if it has enough free space, and the >existing files that have spilled over to the slow device will be gradually removed when RocksDB performs compaction. In other >words, if there is any data spilled over to the slow device, it will be moved to the fast device over time. https://docs.>ceph.com/en/octopus/man/8/ceph-bluestore-tool/#commands

Verify the new osd bluestore setup

ceph-bluestore-tool show-label –dev device []

Update

You might be interested in a migration method on a higher layer with ceph-volume lvm.

docs.clyso.com/blog/ceph-volume-ceph-osd-migrate-db-to-larger-ssd-flash-device/

Appendix

I'm trying to figure out the appropriate process for adding a separate SSD block.db to an existing OSD. From what I gather the two steps are: 1. Use ceph-bluestore-tool bluefs-bdev-new-db to add the new db device 2. Migrate the data ceph-bluestore-tool bluefs-bdev-migrate I followed this and got both executed fine without any error. Yet when the OSD got started up, it keeps on using the integrated block.db instead of the new db. The block.db link to the new db device was deleted. Again, no error, just not using the new db www.spinics.net/lists/ceph-users/msg62357.html

Sources

docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool

tracker.ceph.com/attachments/download/4478/bluestore.png

www.suse.com/support/kb/doc/?id=000020276

· One min read
Joachim Kraftmayer

But, I already mentioned it (for a bit different case) in newer versions there is ceph-volume lvm migrate [1] which I think allows to do the same but in much simpler way. I have not tried it to and the documentation is not very clear to me so one need to experiment with this before writing exact instructions. We might also need to use new-db [2] and new-wal [3] commands before running migrate but I am not sure they are needed for this particular case.

[1] https://docs.ceph.com/en/latest/ceph-volume/lvm/migrate/

[2] https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

[3] https://docs.ceph.com/en/latest/ceph-volume/lvm/newwal/

· One min read
Joachim Kraftmayer

follow-up to osd(s) with unlimited ram growth

There is a way to check a ceph cluster if there are any OSDs affected by the "PG Dup Bug" by running following command:

ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.&#91;0-9]'

This will provide you a list of all OSDs in the cluster containing 2 Parameters:

  1. osd_pglog_bytes
  2. osd_pglog_items

"osd_pglog_items" counter is a sum of "normal" log entries, dup entries and some other things. Taking that osd_target_pg_log_entries_per_osd is 300.000 by default, we may assume that about 300.000 items are "normal" pg log entries, and if "osd_pglog_items" counter is much higher than this it is most likely due to dups. Example:

osd.32: { "osd_pglog_bytes": 1925908608, "osd_pglog_items": 17418324 }

osd_pglog_items = 17.418.324 - 300.000 = probably about 17 Million PG Dups

Running a manual check against this OSD with the commands in with unlimited ram growth revealed 1 PG with 17.090.093 entries. So this is a quick and easy way to identify problematic OSD(s) without the need to stop all OSDs and manually run commands.

Sources:

github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L2951

· 4 min read
Joachim Kraftmayer

We have been working with Ceph for more than 10 years and have observed the following behaviour of osds on different Ceph clusters for several years:

  • osds start very slowly, up to 10 minutes until ceph osd up was reported
  • increased osd memory consumption up to 8GB, with default osd_memory target of 4GB
  • individual hosts that wanted to consume so much main memory that the OOM killer of the Linux kernel terminated the process. Main memory consumption of individual osds tested up to 150 GB.
  • Complete Ceph cluster with all osds that could no longer be started successfully because all osds wanted to consume the maximum amount of main memory until they were terminated by the Linux OOM killer.

There were also many messages in the Ceph mailing list about similar problems, which were analysed extensively. However, they never tracked down the root cause of the errors. Often the problem simply resolved itself or the affected osds were removed and reinstalled.

Together with affected people and colleagues from the community, we were now able to permanently investigate the bug and found the root cause.
tracker.ceph.com/issues/53729

Root Cause

There are dup entries with a version higher than the log entries. This means that if there is any dup entry with a higher version than the tail of the log, we will not trim anything past it, but we will keep accumulating new dups as we trim pg_log_entries and add them to the back of the dup list. tracker.ceph.com/issues/53729#note-57

Affected Ceph versions: all versions that have dups (jewel or luminous and later).

Possible explanation

Possible explanation why we can observe it more and more often is that the autoscaler was not active by default.

Our experience of the last weeks is that we find osds with several million pg dup entries on many Ceph clusters we maintain. The current peak is osds with over 50 million entries (octopus release).

Mitigation

We have developed some tools to mitigate the problem.

This is ceph-objectstore-tool built with the patches from PR github.com/ceph/ceph/pull/45529 See built packages at shaman.ceph.com/repos/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ E.g. this is for bionic 2.chacra.ceph.com/r/ceph/wip-mgolub-testing-pacific/2f62392e88f715976ed8eee2c86b0afd0f1d10ac/ubuntu/bionic/flavors/default/

Note: Shaman is used for testing on teuthology and I am not sure how long the packages remain available there.

Mitigation process

  1. Identify OSD affected, e.g. long bootup after restart

  2. set NOOUT, stop OSD and mount it

    ceph osd set noout
    systemctl stop ceph-FSID@osd.OSD.service
    ceph-volume lvm activate OSD OSD-FSID --no-systemd
  3. create list of PGs on OSD

    ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op list-pgs > osd.11.pgs.txt
  4. Check for DUPs on all PGs

    while read pg; do echo $pg; ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11/ --op log --pgid $pg > pglog.json; jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' < pglog.json; done < /root/osd.11.pgs.txt 2>&1 | tee dups.log
  5. run tool on affected PGs. Check Memory Usage - it depends on the parameter "osd_pg_log_trim_max" - we observed around 4G with osd_pg_log_trim_max=500000. we identified osd_pg_log_trim_max=500000 as the optimum value. further increasing osd_pg_log_trim_max will not speed up the process.

    time ./ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --op trim-pg-log --pgid PG --osd_max_pg_log_entries=100 --osd_pg_log_dups_tracked=100 --osd_pg_log_trim_max=500000

Recommendation

In the example above, osd_pg_log_trim_max is already very high, increasing it further would not increase the speed. For safety reasons, we recommend starting with a smaller number, e.g. 100000. From the experience of the first run, you can then think about increasing or accelerating the speed.

Notice

Special care should be taken with the following actions:

  • Ceph Cluster Upgrade
  • pg split or pg merge- manually or automatically via autoscaler
  • use of the current patch

The listed actions can unintentionally trigger the trimming of the pg dups, which can lead to osds being inaccessible as long as they perform the trim action and can consume an enormous amount of main memory.

This could have a particularly extreme effect on osd in connection with rook and the kubernetes liveness probe.

What will happen when a user that has a problem like us, i.e. 30 million of dups in a pg, but is not aware of it, upgrades to the version with the fixed pg_log trim, and it starts trimming? Am I right understanding that with the current implementation the trim will build the full set of 30 million dups and will try to remove it in one transaction? github.com/ceph/ceph/pull/45529

· One min read
Joachim Kraftmayer

The crash module collects information about daemon crashdumps and stores it in the Ceph cluster for later analysis.

If you see this message in the status of Ceph (ceph -s), you should first execute the following command to list all collected crashes:

ceph crash ls

Here you can see in the output which OSD(s) had or have problems with the respective time of occurrence.

You can get more information with the help of

ceph crash info <ID>

for the respective crash event.

If the crash is no longer relevant it can be confirmed with the following two commands:

ceph crash archive

or

ceph crash archive-all

After that the warning disappears from the ceph status output.

Sources

https://docs.ceph.com/en/quincy/mgr/crash/

· One min read
Joachim Kraftmayer

Create an erasure coded rbd pool

ceph osd pool create ec-pool 1024 1024 erasure 8-3
ceph osd pool set data01 allow_ec_overwrites true
rbd pool init ec-pool

note

Many things can be changed later in ceph during the runtime. However, the settings for the distribution of data and coding chunks must be defined when the EC pool is created. This means you should think carefully about what you plan to do with the pool in the future.

Create a erasure coded rbd image, in the EC data pool and for the metadata (OMAP objects) you need the replicated target-pool:

rbd create --size 25G --data-pool ec-pool/origin-image target-pool/new-image
rbd info target-pool/new-image

Sources

docs.ceph.com/en/latest/rados/operations/erasure-code/