multisite environment - ceph bucket index dynamic resharding
Dynamic resharding is not supported in multisite environment. It is disabled by default since Ceph 12.2.2, but we recommend you to double check the setting.
Dynamic resharding is not supported in multisite environment. It is disabled by default since Ceph 12.2.2, but we recommend you to double check the setting.
radosgw-admin key create --uid=clyso-user-id --key-type=s3 --gen-access-key --gen-secret
...
"keys": [
{
"user": "clyso-user-id",
"access_key": "VO8C17LBI9Y39FSODOU5",
"secret_key": "zExCLO1bLQJXoY451ZiKpeoePLSQ1khOJG4CcT3N"
}
],
...
We aim to create an erasure coding pool with the Failure Domain room = Availability Zone.
We have come up with a rule for this:
rule ec_4_2_rule {
id 3
type erasure
min_size 5
max_size 6
step take eu-de-root
step choose indep 3 type room
step choose indep 2 type host
step chooseleaf indep 1 type osd
step emit
}
Description:
Take the crush root eu-de-root then select 3 independent buckets of type room and select 2 buckets of type host in each room and take one osd from each of the selected hosts.
rule ec_6_3_rule {
id 4
type erasure
min_size 8
max_size 9
step take eu-de-root
step choose indep 3 type room
step choose indep 3 type host
step chooseleaf indep 1 type osd
step emit
}
There are several testing frameworks for Openstack Swift, but most of them are no longer maintained or have some other problems to test larger Ceph clusters properly under load.
At this point we decided to use getput by Mark Seger.
Our Ceph test cluster consisted of 108 Ceph nodes with a total of 2592 osds. The network was a spine-leaf architecture.
12 compute nodes with 56 HT cores and 100 GBit/s network connectivity.
gpsuite
Sources:
librmb has been renamed to dovecot ceph plugin.
A good example of how a DAX company like Telekom has had a Ceph plugin developed in order to be able to operate its email service optimally.
List of users:
radosgw-admin metadata list user
List of buckets:
radosgw-admin metadata list bucket
List of bucket instances:
radosgw-admin metadata list user.instance
All necessary information
radosgw-admin bucket link --bucket <bucket-name> --bucket-id <default-uuid>.267207.1 --uid=<user-uid>
Example:
radosgw-admin bucket link --bucket test-clyso-test --bucket-id aa81cf7e-38c5-4200-b26b-86e900207813.267207.1 --uid=c19f62adbc7149ad9d19-8acda2dcf3c0
If you compare the buckets before and after the change, the following values are changed:
During maintenance work, we noticed that a "ceph clock skew detected" always occurred when restarting the Ceph monitors.
What makes this interesting for the Ceph Monitor Leader is the fact that it assumes that its time is correct and that everyone else's time is wrong.
This means that the cluster remains in the "ceph clock skew detected" state with all its consequences until the time is synchronized.
All ceph monitors and osd nodes are synchronized via ntp and so the state lasts for about 20 seconds.
Possible solutions:
Synchronize the hardware clock before rebooting
hwclock --systohc
When starting the system, do not start the Ceph Monitor until the time has been synchronized.
However, you should always bear in mind the risks and side effects on rainy days.
osd max object size
Description: The maximum size of a RADOS object in bytes.
Type: 32-bit Unsigned Integer
Default: 128MB
Before the Ceph Luminous release, the default value was 100 GB. Now it has been reduced to 128 MB. This means that unpleasant performance problems can be prevented right from the start
github.com/ceph/ceph/pull/15520
docs.ceph.com/docs/master/releases/luminous/
docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
We encountered the first large omap objects in one of our Luminous Ceph clusters in Q3 2018 and worked with a couple of Ceph Core developers on the solution for internal management of RadosGW objects. This included topics such as large omap objects, dynamic resharding, multisite, deleting old object instances in the RadosGW index pool, and many small changes that were included in the Luminous, Mimic, and subsequent versions.
Here is a step by step guide on how to identify large omap objects and buckets and then manually reshard the affected objects.
ceph -s
cluster:
id: 52296cfd-d6c6-3129-bf70-db16f0e4423d
health: HEALTH_WARN
1 large omap object
ceph health detail
HEALTH_WARN 1 large omap objects
1 large objects found in pool 'clyso-test-sin-1.rgw.buckets.index'
Search the cluster log for 'Large omap object found' for more details.
search the ceph.log of the Ceph cluster:
2018-09-26 12:10:38.440682 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77104 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2018-09-26 12:10:35.037753 osd.1262 osd.1262 192.168.130.31:6836/10060 152 : cluster [WRN] Large omap object found. Object: 28:18428495:::.dir.143112fc-1178-40e1-b209-b859cd2c264c.38511450.376:head Key count: 2928429 Size (bytes): 861141085
2018-09-26 13:00:00.000103 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77505 : cluster [WRN] overall HEALTH_WARN 1 large omap objects
From the ceph.log we extract the bucket instance, in this case:
143112fc-1178-40e1-b209-b859cd2c264c.38511450.376 and look for it in the RadosGW metadata
root@salt-master1.clyso.test:~ # radosgw-admin metadata list "bucket.instance" | egrep "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
"b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
root@salt-master1.clyso.test:~ #
The instance exists and we checked the metadata of the instance.
root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376
{
"key": "bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"ver": {
"tag": "_Ehz5PYLhHBxpsJ_s39lePnX",
"ver": 7
},
"mtime": "2018-04-24 10:02:32.362129Z",
"data": {
"bucket_info": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"creation_time": "2018-02-20 20:58:51.125791Z",
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"flags": 0,
"zonegroup": "1c44aba5-fe64-4db3-9ef7-f0eb30bf5d80",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": true,
"check_on_raw": true,
"max_size": 54975581388800,
"max_size_kb": 53687091200,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": "AgK4A.....AAAAAAA="
},
{
"key": "user.rgw.idtag",
"val": ""
},
{
"key": "user.rgw.x-amz-read",
"val": "aW52YWxpZAA="
},
{
"key": "user.rgw.x-amz-write",
"val": "aW52YWxpZAA="
}
]
}
}
root@salt-master1.clyso.test:~ #
root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket:b1868d6d-9d61-49b0-b101-c89207009b16
{
"key": "bucket:b1868d6d-9d61-49b0-b101-c89207009b16",
"ver": {
"tag": "_WaSWh9mb21kEjHCisSzhWs8",
"ver": 1
},
"mtime": "2018-02-20 20:58:51.152766Z",
"data": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"creation_time": "2018-02-20 20:58:51.125791Z",
"linked": "true",
"has_bucket_info": "false"
}
}
root@salt-master1.clyso.test:~ #
root@salt-master1.clyso.test:~ # rados -p eu-de-200-1.rgw.buckets.index ls | egrep “143112fc-1178-40e1-b209-b859cd2c264c.38511450.376” | wc -l
1
root@salt-master1.clyso.test:~ #
143112fc-1178-40e1-b209-b859cd2c264c.38511450.376
size & size_kb: summary of all objects sizes in the bucket/container = output swift stat <bucket/container> | grep Bytes
size_actual & size_kb_actual: account for compression, encryption (showing the nearest 4k alignment) = output swift stat <bucket/container> | grep X-Container-Bytes-Used-Actual
num_objects: number of objects = output swift stat <bucket/container> | grep Objects
size_utilized & size_kb_utilized: represent the total size of compressed data in byte and kilobytes => we don´t use compression so size = size_utilized
The size does not include the information of the underlying replication of 3 or erasure coding.
ceph-rgw4:~/clyso# radosgw-admin bucket stats --bucket=size-container
{
"bucket": "size-container",
"zonegroup": "226fe09d-0ebf-4f30-a93b-d136f24a04d3",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"marker": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"index_type": "Normal",
"owner": "0fdfa377cd56439ab3e3e65c69787e92",
"ver": "0#7",
"master_ver": "0#0",
"mtime": "2018-09-03 12:37:37.744221",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": true,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
ceph-rgw4:~/clyso # swift stat size-container
Account: v1
Container: size-container
Objects: 3
Bytes: 4149
Read ACL:
Write ACL:
Sync To:
Sync Key:
Accept-Ranges: bytes
X-Storage-Policy: default-placement
X-Container-Bytes-Used-Actual: 16384
X-Timestamp: 1535967792.05717
X-Trans-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1
Content-Type: text/plain; charset=utf-8
X-Openstack-Request-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1
The output of the sizes was as follows:
"size": 26,
"size_actual": 4096,
"size_utilized": 26,
"size_kb": 1,
"size_kb_actual": 4,
"size_kb_utilized": 1,
"num_objects": 1
"size": 52,
"size_actual": 8192,
"size_utilized": 52,
"size_kb": 1,
"size_kb_actual": 8,
"size_kb_utilized": 1,
"num_objects": 2
"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3