Skip to main content

100 posts tagged with "ceph"

View All Tags

· One min read
Joachim Kraftmayer

When configuring osd in mixed setup with db and wal colocated on a flash device, ssd or NVMe. There were always changes and irritations where the DB and the WAL are really located. With a simple test it can be checked: The location of the DB for the respective OSD can be verified via ceph osd metadata osd.<id> and the variable "bluefs_dedicated_db": "1".

The WAL was created separately in earlier Ceph versions and automatically on the same device as the DB in later Ceph versions. The WAL can be easily tested by using the ceph osd.<id> tell bench command.

First you check larger write operations with the command:

ceph tell osd.0 bench 65536 409600

Second, you check with smaller objects that are smaller than the bluestore_prefer_deferred_size_hdd (64k).

ceph tell osd.0 bench 65536 4096

If you compare the IOPs of the two tests, one result should correspond to the IOPs of an SSD and the other result should be quite low for the HDD. From this you can know if the WAL is on the HDD or the flash device.

· One min read
Joachim Kraftmayer

Bluestore/RocksDB will only put the next level up size of DB on flash if the whole size will fit. These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are pointless. Only ~3GB of SSD will ever be used out of a 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will be used.

How do I find the right SSD/NVMe partition size for hot DB.

https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

· One min read
Joachim Kraftmayer

In the presentation by Sage Weil at the Openstack Summit 2018 in Berlin, he presents the roadmap for the upcoming features in version Nautilus and the following. Strong focus on the trend of multi- and hybrid cloud environments and scaling of Ceph clusters across data center boundaries.

www.slideshare.net/sageweil1/ceph-data-services-in-a-multi-and-hybrid-cloud-world

· One min read
Joachim Kraftmayer

When commissioning a cluster, it is always advisable to log and evaluate the ceph osd bench results.

The values can also be helpful for performance analysis in a productive Ceph cluster.

ceph tell osd.<int|*> bench {<int>} {<int>} {<int>}

OSD benchmark: write <count> <size> -byte objects, (default 1G size 4MB)

osd_bench_max_block_size=65536 kB

Example:

1G size 4MB (default)

ceph tell osd.* bench

1G size 64MB

ceph tell osd.* bench 1073741824 67108864

· 2 min read
Joachim Kraftmayer

When creating an RBD image, you can pass the stripe unit and the stripe count.
A smaller stripe unit means that smaller write operations are better distributed across the Ceph cluster with its OSDs.

rbd -p benchpool create image-su-64kb --size 102400 --stripe-unit 65536 --stripe-count 16

RBD images are striped over many objects, which are then stored by the Ceph distributed object store (RADOS). As a result, read and write requests for the image are distributed across many nodes in the cluster, generally preventing any single node from becoming a bottleneck when individual images get large or busy.

The striping is controlled by three parameters:

order The size of objects we stripe over is a power of two, specifically 2^[order] bytes. The default is 22, or 4 MB.

stripe_unit Each [stripe_unit] contiguous bytes are stored adjacently in the same object, before we move on to the next object.

stripe_count After we write [stripe_unit] bytes to [stripe_count] objects, we loop back to the initial object and write another stripe, until the object reaches its maximum size (as specified by [order]. At that point, we move on to the next [stripe_count] objects. By default, [stripe_unit] is the same as the object size and [stripe_count] is 1. Specifying a different [stripe_unit] requires that the STRIPINGV2 feature be supported (added in Ceph v0.53) and format 2 images be used.

docs.ceph.com/docs/giant/man/8/rbd/#striping

· One min read
Joachim Kraftmayer
for date in \`ceph pg dump | grep active | awk '{print $20}'\`; do date +%A -d $date; done | sort | uniq -c

19088 Monday
1752 Saturday
54296 Sunday
for date in \`ceph pg dump | grep active | awk '{print $21}'\`; do date +%H -d $date; done | sort | uniq -c

dumped all
3399 00
3607 01
2449 02
2602 03
6145 04
4907 05
4986 06
3777 07
2421 08
2429 09
2478 10
2546 11
2523 12
2614 13
2661 14
2722 15
2669 16
2649 17
2656 18
2751 19
2780 20
2893 21
3157 22
3315 23

· One min read
Joachim Kraftmayer

radosgw-admin key create --uid=clyso-user-id --key-type=s3 --gen-access-key --gen-secret

...

"keys": [
{
"user": "clyso-user-id",
"access_key": "VO8C17LBI9Y39FSODOU5",
"secret_key": "zExCLO1bLQJXoY451ZiKpeoePLSQ1khOJG4CcT3N"
}
],

...

access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/administration_cli#create_a_key

· One min read
Joachim Kraftmayer

We aim to create an erasure coding pool with the Failure Domain room = Availability Zone.

We have come up with a rule for this:

EC M=4, K=2

rule ec_4_2_rule {
id 3
type erasure
min_size 5
max_size 6
step take eu-de-root
step choose indep 3 type room
step choose indep 2 type host
step chooseleaf indep 1 type osd
step emit
}

Description:

Take the crush root eu-de-root then select 3 independent buckets of type room and select 2 buckets of type host in each room and take one osd from each of the selected hosts.

EC M=6, K=3

rule ec_6_3_rule {
id 4
type erasure
min_size 8
max_size 9
step take eu-de-root
step choose indep 3 type room
step choose indep 3 type host
step chooseleaf indep 1 type osd
step emit
}