Skip to main content

101 posts tagged with "ceph"

View All Tags

· One min read
Joachim Kraftmayer

We aim to create an erasure coding pool with the Failure Domain room = Availability Zone.

We have come up with a rule for this:

EC M=4, K=2

rule ec_4_2_rule {
id 3
type erasure
min_size 5
max_size 6
step take eu-de-root
step choose indep 3 type room
step choose indep 2 type host
step chooseleaf indep 1 type osd
step emit
}

Description:

Take the crush root eu-de-root then select 3 independent buckets of type room and select 2 buckets of type host in each room and take one osd from each of the selected hosts.

EC M=6, K=3

rule ec_6_3_rule {
id 4
type erasure
min_size 8
max_size 9
step take eu-de-root
step choose indep 3 type room
step choose indep 3 type host
step chooseleaf indep 1 type osd
step emit
}

· One min read
Joachim Kraftmayer

There are several testing frameworks for Openstack Swift, but most of them are no longer maintained or have some other problems to test larger Ceph clusters properly under load.

At this point we decided to use getput by Mark Seger.

Test Setup

Ceph Cluster

Our Ceph test cluster consisted of 108 Ceph nodes with a total of 2592 osds. The network was a spine-leaf architecture.

Openstack Compute Nodes

12 compute nodes with 56 HT cores and 100 GBit/s network connectivity.

Test setup

gpsuite

Sources:

github.com/markseger/getput

· One min read
Joachim Kraftmayer

List of users:

radosgw-admin metadata list user

List of buckets:

radosgw-admin metadata list bucket

List of bucket instances:

radosgw-admin metadata list user.instance

All necessary information

  • user-id = Output from the list of users
  • bucket-id = Output from the list of bucket instances
  • bucket-name = Output from the list of buckets or bucket instances
  • Change of user for this bucket instance:
radosgw-admin bucket link --bucket <bucket-name> --bucket-id <default-uuid>.267207.1 --uid=<user-uid>

Example:

radosgw-admin bucket link --bucket test-clyso-test --bucket-id aa81cf7e-38c5-4200-b26b-86e900207813.267207.1 --uid=c19f62adbc7149ad9d19-8acda2dcf3c0

If you compare the buckets before and after the change, the following values are changed:

  • ver: is increased
  • mtime: will be updated
  • owner: is set to the new uid
  • key: user.rgw.acl: The rights are reset for the user.rgw.acl key

· One min read
Joachim Kraftmayer

During maintenance work, we noticed that a "ceph clock skew detected" always occurred when restarting the Ceph monitors.

What makes this interesting for the Ceph Monitor Leader is the fact that it assumes that its time is correct and that everyone else's time is wrong.

This means that the cluster remains in the "ceph clock skew detected" state with all its consequences until the time is synchronized.

All ceph monitors and osd nodes are synchronized via ntp and so the state lasts for about 20 seconds.

Possible solutions:

Synchronize the hardware clock before rebooting

hwclock --systohc

When starting the system, do not start the Ceph Monitor until the time has been synchronized.

However, you should always bear in mind the risks and side effects on rainy days.

· One min read
Joachim Kraftmayer
osd max object size

Description: The maximum size of a RADOS object in bytes.
Type: 32-bit Unsigned Integer
Default: 128MB

Before the Ceph Luminous release, the default value was 100 GB. Now it has been reduced to 128 MB. This means that unpleasant performance problems can be prevented right from the start

github.com/ceph/ceph/pull/15520

docs.ceph.com/docs/master/releases/luminous/

docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

· 3 min read
Joachim Kraftmayer

We encountered the first large omap objects in one of our Luminous Ceph clusters in Q3 2018 and worked with a couple of Ceph Core developers on the solution for internal management of RadosGW objects. This included topics such as large omap objects, dynamic resharding, multisite, deleting old object instances in the RadosGW index pool, and many small changes that were included in the Luminous, Mimic, and subsequent versions.

Here is a step by step guide on how to identify large omap objects and buckets and then manually reshard the affected objects.

output ceph status

ceph -s

cluster:
id: 52296cfd-d6c6-3129-bf70-db16f0e4423d
health: HEALTH_WARN
1 large omap object

output ceph health detail

ceph health detail
HEALTH_WARN 1 large omap objects
1 large objects found in pool 'clyso-test-sin-1.rgw.buckets.index'
Search the cluster log for 'Large omap object found' for more details.
search the ceph.log of the Ceph cluster:
2018-09-26 12:10:38.440682 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77104 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2018-09-26 12:10:35.037753 osd.1262 osd.1262 192.168.130.31:6836/10060 152 : cluster [WRN] Large omap object found. Object: 28:18428495:::.dir.143112fc-1178-40e1-b209-b859cd2c264c.38511450.376:head Key count: 2928429 Size (bytes): 861141085
2018-09-26 13:00:00.000103 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77505 : cluster [WRN] overall HEALTH_WARN 1 large omap objects

From the ceph.log we extract the bucket instance, in this case:

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376 and look for it in the RadosGW metadata

root@salt-master1.clyso.test:~ # radosgw-admin metadata list "bucket.instance" | egrep "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
"b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
root@salt-master1.clyso.test:~ #

The instance exists and we checked the metadata of the instance.

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376
{
"key": "bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"ver": {
"tag": "_Ehz5PYLhHBxpsJ_s39lePnX",
"ver": 7
},
"mtime": "2018-04-24 10:02:32.362129Z",
"data": {
"bucket_info": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"creation_time": "2018-02-20 20:58:51.125791Z",
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"flags": 0,
"zonegroup": "1c44aba5-fe64-4db3-9ef7-f0eb30bf5d80",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": true,
"check_on_raw": true,
"max_size": 54975581388800,
"max_size_kb": 53687091200,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": "AgK4A.....AAAAAAA="
},
{
"key": "user.rgw.idtag",
"val": ""
},
{
"key": "user.rgw.x-amz-read",
"val": "aW52YWxpZAA="
},
{
"key": "user.rgw.x-amz-write",
"val": "aW52YWxpZAA="
}
]
}
}
root@salt-master1.clyso.test:~ #

get the metadata infos from the bucket

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket:b1868d6d-9d61-49b0-b101-c89207009b16
{
"key": "bucket:b1868d6d-9d61-49b0-b101-c89207009b16",
"ver": {
"tag": "_WaSWh9mb21kEjHCisSzhWs8",
"ver": 1
},
"mtime": "2018-02-20 20:58:51.152766Z",
"data": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"creation_time": "2018-02-20 20:58:51.125791Z",
"linked": "true",
"has_bucket_info": "false"
}
}
root@salt-master1.clyso.test:~ #

grep for the bucket_id in the radosgw index pool

root@salt-master1.clyso.test:~ # rados -p eu-de-200-1.rgw.buckets.index ls | egrep “143112fc-1178-40e1-b209-b859cd2c264c.38511450.376” | wc -l
1
root@salt-master1.clyso.test:~ #

the bucket rados object, that has to be resharded

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376

· 2 min read
Joachim Kraftmayer

size & size_kb: summary of all objects sizes in the bucket/container = output swift stat <bucket/container> | grep Bytes

size_actual & size_kb_actual: account for compression, encryption (showing the nearest 4k alignment) = output swift stat <bucket/container> | grep X-Container-Bytes-Used-Actual

num_objects: number of objects = output swift stat <bucket/container> | grep Objects

size_utilized & size_kb_utilized: represent the total size of compressed data in byte and kilobytes => we don´t use compression so size = size_utilized

The size does not include the information of the underlying replication of 3 or erasure coding.

ceph-rgw4:~/clyso# radosgw-admin bucket stats --bucket=size-container
{
"bucket": "size-container",
"zonegroup": "226fe09d-0ebf-4f30-a93b-d136f24a04d3",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"marker": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"index_type": "Normal",
"owner": "0fdfa377cd56439ab3e3e65c69787e92",
"ver": "0#7",
"master_ver": "0#0",
"mtime": "2018-09-03 12:37:37.744221",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": true,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
ceph-rgw4:~/clyso # swift stat size-container
Account: v1
Container: size-container
Objects: 3
Bytes: 4149
Read ACL:
Write ACL:
Sync To:
Sync Key:
Accept-Ranges: bytes
X-Storage-Policy: default-placement
X-Container-Bytes-Used-Actual: 16384
X-Timestamp: 1535967792.05717
X-Trans-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1
Content-Type: text/plain; charset=utf-8
X-Openstack-Request-Id: tx00000000000000002378a-005b8e218c-b2faf1-eu-de-997-1

We first uploaded a 20 byte object, then another 20 byte object and then a 4097 byte object

The output of the sizes was as follows:

1 Objekt

"size": 26,
"size_actual": 4096,
"size_utilized": 26,
"size_kb": 1,
"size_kb_actual": 4,
"size_kb_utilized": 1,
"num_objects": 1

2 Objekte

"size": 52,
"size_actual": 8192,
"size_utilized": 52,
"size_kb": 1,
"size_kb_actual": 8,
"size_kb_utilized": 1,
"num_objects": 2

3 Objekte

"size": 4149,
"size_actual": 16384,
"size_utilized": 4149,
"size_kb": 5,
"size_kb_actual": 16,
"size_kb_utilized": 5,
"num_objects": 3

· One min read
Joachim Kraftmayer

The aim is to achieve a scaling of the rgw instances for the production system so that 10,000 active connections are possible.

As a result of various test runs, the following configuration emerged for our setup

[client.rgw.<id>]
keyring = /etc/ceph/ceph.client.rgw.keyring
rgw content length compat = true
rgw dns name = <rgw.hostname.clyso.com>
rgw enable ops log = false
rgw enable usage log = false
rgw frontends = civetweb port=80
error_log_file=/var/log/radosgw/civetweb.error.log
rgw num rados handles = 8
rgw swift url = http://<rgw.hostname.clyso.com>
rgw thread pool size = 512

Notes on the configuration

rgw thread pool size ist der Standardwert für num_threads des civeweb webservers.

Line 54: https://github.com/ceph/ceph/blob/master/src/rgw/rgw_civetweb_frontend.cc

set_conf_default(conf_map, "num_threads",
std::to_string(g_conf->rgw_thread_pool_size));
[client.radosgw]
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw content length compat = true
rgw dns name = <fqdn hostname>
rgw enable ops log = false
rgw enable usage log = false
rgw frontends = civetweb port=8080 num_threads=512
error_log_file=/var/log/radosgw/civetweb.error.log
rgw num rados handles = 8
rgw swift url = http://<fqdn hostname>
rgw thread pool size = 51``

sources

https://github.com/ceph/ceph/blob/master/doc/radosgw/config-ref.rst

http://docs.ceph.com/docs/master/radosgw/config-ref/

https://github.com/ceph/ceph/blob/master/src/rgw/rgw_civetweb_frontend.cc

https://indico.cern.ch/event/578974/contributions/2695212/attachments/1521538/2377177/Ceph_pre-gdb_2017.pdf

http://www.osris.org/performance/rgw.html

https://www.swiftstack.com/docs/integration/python-swiftclient.html

https://github.com/civetweb/civetweb/tree/master/docs