Clyso Blog | Clyso GmbH

multisite environment - ceph bucket index dynamic resharding

September 9, 2018 · One min read

Managing Director at Clyso

Dynamic resharding is not supported in multisite environment. It is disabled by default since Ceph 12.2.2, but we recommend you to double check the setting.

Sources

https://www.suse.com/documentation/suse-enterprise-storage-5/book_storage_admin/data/ogw_bucket_sharding.html

Ceph radosgw-admin create S3 access_key and secret

September 3, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

radosgw-admin key create --uid=clyso-user-id --key-type=s3 --gen-access-key --gen-secret

...

"keys": [
{
"user": "clyso-user-id",
"access_key": "VO8C17LBI9Y39FSODOU5",
"secret_key": "zExCLO1bLQJXoY451ZiKpeoePLSQ1khOJG4CcT3N"
}
],

...

access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/object_gateway_guide_for_red_hat_enterprise_linux/administration_cli#create_a_key

Create Ceph Erasure Coding Pool

August 27, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

We aim to create an erasure coding pool with the Failure Domain room = Availability Zone.

We have come up with a rule for this:

EC M=4, K=2

rule ec_4_2_rule {
id 3
type erasure
min_size 5
max_size 6
step take eu-de-root
step choose indep 3 type room
step choose indep 2 type host
step chooseleaf indep 1 type osd
step emit
}

Description:

Take the crush root eu-de-root then select 3 independent buckets of type room and select 2 buckets of type host in each room and take one osd from each of the selected hosts.

EC M=6, K=3

rule ec_6_3_rule {
id 4
type erasure
min_size 8
max_size 9
step take eu-de-root
step choose indep 3 type room
step choose indep 3 type host
step chooseleaf indep 1 type osd
step emit
}

Ceph performance tests via Openstack Swift Backend

August 14, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

There are several testing frameworks for Openstack Swift, but most of them are no longer maintained or have some other problems to test larger Ceph clusters properly under load.

At this point we decided to use getput by Mark Seger.

Test Setup

Ceph Cluster

Our Ceph test cluster consisted of 108 Ceph nodes with a total of 2592 osds. The network was a spine-leaf architecture.

Openstack Compute Nodes

12 compute nodes with 56 HT cores and 100 GBit/s network connectivity.

Test setup

gpsuite

Sources:

github.com/markseger/getput

dovecot mit ceph backend

July 12, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

librmb has been renamed to dovecot ceph plugin.

A good example of how a DAX company like Telekom has had a Ceph plugin developed in order to be able to operate its email service optimally.

github.com/ceph-dovecot/dovecot-ceph-plugin

Assign RadosGW Bucket to another user

July 9, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

List of users:

radosgw-admin metadata list user

List of buckets:

radosgw-admin metadata list bucket

List of bucket instances:

radosgw-admin metadata list user.instance

All necessary information

user-id = Output from the list of users
bucket-id = Output from the list of bucket instances
bucket-name = Output from the list of buckets or bucket instances
Change of user for this bucket instance:

radosgw-admin bucket link --bucket <bucket-name> --bucket-id <default-uuid>.267207.1 --uid=<user-uid>

Example:

radosgw-admin bucket link --bucket test-clyso-test --bucket-id aa81cf7e-38c5-4200-b26b-86e900207813.267207.1 --uid=c19f62adbc7149ad9d19-8acda2dcf3c0

If you compare the buckets before and after the change, the following values are changed:

ver: is increased
mtime: will be updated
owner: is set to the new uid
key: user.rgw.acl: The rights are reset for the user.rgw.acl key

Ceph Monitor leader clow skew

June 6, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

During maintenance work, we noticed that a "ceph clock skew detected" always occurred when restarting the Ceph monitors.

What makes this interesting for the Ceph Monitor Leader is the fact that it assumes that its time is correct and that everyone else's time is wrong.

This means that the cluster remains in the "ceph clock skew detected" state with all its consequences until the time is synchronized.

All ceph monitors and osd nodes are synchronized via ntp and so the state lasts for about 20 seconds.

Possible solutions:

Synchronize the hardware clock before rebooting

hwclock --systohc

When starting the system, do not start the Ceph Monitor until the time has been synchronized.

However, you should always bear in mind the risks and side effects on rainy days.

osd max object size

May 12, 2018 · One min read

Joachim Kraftmayer

Managing Director at Clyso

osd max object size

Description: The maximum size of a RADOS object in bytes.
Type: 32-bit Unsigned Integer
Default: 128MB

Before the Ceph Luminous release, the default value was 100 GB. Now it has been reduced to 128 MB. This means that unpleasant performance problems can be prevented right from the start

github.com/ceph/ceph/pull/15520

docs.ceph.com/docs/master/releases/luminous/

docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

ceph health HEALTH_WARN 1 large omap objects

May 9, 2018 · 3 min read

Joachim Kraftmayer

Managing Director at Clyso

We encountered the first large omap objects in one of our Luminous Ceph clusters in Q3 2018 and worked with a couple of Ceph Core developers on the solution for internal management of RadosGW objects. This included topics such as large omap objects, dynamic resharding, multisite, deleting old object instances in the RadosGW index pool, and many small changes that were included in the Luminous, Mimic, and subsequent versions.

Here is a step by step guide on how to identify large omap objects and buckets and then manually reshard the affected objects.

output ceph status

ceph -s

cluster:
id: 52296cfd-d6c6-3129-bf70-db16f0e4423d
health: HEALTH_WARN
1 large omap object

output ceph health detail

ceph health detail
HEALTH_WARN 1 large omap objects
1 large objects found in pool 'clyso-test-sin-1.rgw.buckets.index'
Search the cluster log for 'Large omap object found' for more details.
search the ceph.log of the Ceph cluster:
2018-09-26 12:10:38.440682 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77104 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2018-09-26 12:10:35.037753 osd.1262 osd.1262 192.168.130.31:6836/10060 152 : cluster [WRN] Large omap object found. Object: 28:18428495:::.dir.143112fc-1178-40e1-b209-b859cd2c264c.38511450.376:head Key count: 2928429 Size (bytes): 861141085
2018-09-26 13:00:00.000103 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77505 : cluster [WRN] overall HEALTH_WARN 1 large omap objects

From the ceph.log we extract the bucket instance, in this case:

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376 and look for it in the RadosGW metadata

root@salt-master1.clyso.test:~ # radosgw-admin metadata list "bucket.instance" | egrep "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
"b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
root@salt-master1.clyso.test:~ #

The instance exists and we checked the metadata of the instance.

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376
{
"key": "bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"ver": {
"tag": "_Ehz5PYLhHBxpsJ_s39lePnX",
"ver": 7
},
"mtime": "2018-04-24 10:02:32.362129Z",
"data": {
"bucket_info": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"creation_time": "2018-02-20 20:58:51.125791Z",
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"flags": 0,
"zonegroup": "1c44aba5-fe64-4db3-9ef7-f0eb30bf5d80",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": true,
"check_on_raw": true,
"max_size": 54975581388800,
"max_size_kb": 53687091200,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": "AgK4A.....AAAAAAA="
},
{
"key": "user.rgw.idtag",
"val": ""
},
{
"key": "user.rgw.x-amz-read",
"val": "aW52YWxpZAA="
},
{
"key": "user.rgw.x-amz-write",
"val": "aW52YWxpZAA="
}
]
}
}
root@salt-master1.clyso.test:~ #

get the metadata infos from the bucket

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket:b1868d6d-9d61-49b0-b101-c89207009b16
{
"key": "bucket:b1868d6d-9d61-49b0-b101-c89207009b16",
"ver": {
"tag": "_WaSWh9mb21kEjHCisSzhWs8",
"ver": 1
},
"mtime": "2018-02-20 20:58:51.152766Z",
"data": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"creation_time": "2018-02-20 20:58:51.125791Z",
"linked": "true",
"has_bucket_info": "false"
}
}
root@salt-master1.clyso.test:~ #

grep for the bucket_id in the radosgw index pool

root@salt-master1.clyso.test:~ # rados -p eu-de-200-1.rgw.buckets.index ls | egrep “143112fc-1178-40e1-b209-b859cd2c264c.38511450.376” | wc -l
1
root@salt-master1.clyso.test:~ #

the bucket rados object, that has to be resharded

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376

output of swift stat and radosgw-admin bucket stat

April 15, 2018 · 2 min read

Joachim Kraftmayer

Managing Director at Clyso

size & size_kb: summary of all objects sizes in the bucket/container = output swift stat <bucket/container> | grep Bytes

size_actual & size_kb_actual: account for compression, encryption (showing the nearest 4k alignment) = output swift stat <bucket/container> | grep X-Container-Bytes-Used-Actual

num_objects: number of objects = output swift stat <bucket/container> | grep Objects

size_utilized & size_kb_utilized: represent the total size of compressed data in byte and kilobytes => we don´t use compression so size = size_utilized

The size does not include the information of the underlying replication of 3 or erasure coding.

ceph-rgw4:~/clyso# radosgw-admin bucket stats --bucket=size-container
{
"bucket": "size-container",
"zonegroup": "226fe09d-0ebf-4f30-a93b-d136f24a04d3",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"marker": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"index_type": "Normal",
"owner": "0fdfa377cd56439ab3e3e65c69787e92",
"ver": "0#7",
"master_ver": "0#0",
"mtime": "2018-09-03 12:37:37.744221",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": true,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

ceph-rgw4:~/clyso # swift stat size-container
Account: v1
Container: size-container
Objects: 3
Bytes: 4149
Read ACL:
Write ACL:
Sync To:
Sync Key:
Accept-Ranges: bytes
X-Storage-Policy: default-placement
X-Container-Bytes-Used-Actual: 16384
X-Timestamp: 1535967792.05717
X-Trans-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1
Content-Type: text/plain; charset=utf-8
X-Openstack-Request-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1

We first uploaded a 20 byte object, then another 20 byte object and then a 4097 byte object

The output of the sizes was as follows:

1 Objekt

"size": 26,
"size_actual": 4096,
"size_utilized": 26,
"size_kb": 1,
"size_kb_actual": 4,
"size_kb_utilized": 1,
"num_objects": 1

2 Objekte

"size": 52,
"size_actual": 8192,
"size_utilized": 52,
"size_kb": 1,
"size_kb_actual": 8,
"size_kb_utilized": 1,
"num_objects": 2

3 Objekte

"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3

Sources​

EC M=4, K=2​

EC M=6, K=3​

Test Setup​

Ceph Cluster​

Openstack Compute Nodes​

Test setup​

output ceph status​

output ceph health detail​

get the metadata infos from the bucket​

grep for the bucket_id in the radosgw index pool​

the bucket rados object, that has to be resharded​

We first uploaded a 20 byte object, then another 20 byte object and then a 4097 byte object​

1 Objekt​

2 Objekte​

3 Objekte​

Sources

EC M=4, K=2

EC M=6, K=3

Test Setup

Ceph Cluster

Openstack Compute Nodes

Test setup

output ceph status

output ceph health detail

get the metadata infos from the bucket

grep for the bucket_id in the radosgw index pool

the bucket rados object, that has to be resharded

We first uploaded a 20 byte object, then another 20 byte object and then a 4097 byte object

1 Objekt

2 Objekte

3 Objekte