Skip to main content

100 posts tagged with "ceph"

View All Tags

· 2 min read
Joachim Kraftmayer

Ceph Version: Luminous 12.2.2 with Filestore and XFS

After more than 2 years, several OSDs on the productive Ceph cluster reported the error message:

** ERROR: osd init failed: (28) No space left on device

and terminated itself. Attempts to restart the OSD always ended with the same error message.

The Ceph cluster changed from HEALTH_OK to HEALTH_ERR status with the warning:

ceph osd near full

ceph pool near full

The superficial check with df -h sometimes showed 71% to 89% used disk space and no more files could be created in the file system.

No remount or unmount and mount has changed the situation.

The first suspicion was that the inode64 option for XFS might be missing, but this option was set. After closer examination of the internal statistics of the XFS file system with

xfs_db -r "-c freesp -s" /dev/sdd1

df -h

df -i

we chose the following solution:

First we stopped the recovery with

ceph osd set noout

so as not to fill the remaining OSDs any further. We then automatically distributed the data on the remaining Ceph cluster according to usage with

ceph osd reweight-by-utilization

We then moved a single PG (important: always different PGs per OSD) from the affected OSD to /root to have additional space on the file system and started the OSDs.

In the next step, we deleted virtual machine images that were no longer required from our cloud environment.

It took some time for the blocked requests to clear and the system to resume normal operation.

Unfortunately, it was not possible for us to definitively clarify the cause.

However, as we are currently in the process of switching from Filestore to Bluestore, we will soon no longer need XFS.

· One min read
Joachim Kraftmayer

If you quickly need the syntax for the radosgw-admin command.

clyso-ceph-rgw-client:~/clyso # radosgw-admin object stat --bucket=size-container --object=clysofile


{
"name": "clysofile",
"size": 26,
"policy": {
"acl": {
"acl_user_map": [
{
"user": "clyso-user",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "clyso-user",
"grant": {
"type": {
"type": 0
},
"id": "clyso-user",
"email": "",
"permission": {
"flags": 15
},
"name": "clyso-admin",
"group": 0,
"url_spec": ""
}
}
]
},
"owner": {
"id": "clyso-user",
"display_name": "clyso-admin"
}
},
"etag": "clyso-user",
"tag": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11729649.143382",
"manifest": {
"objs": [],
"obj_size": 26,
"explicit_objs": "false",
"head_size": 26,
"max_head_size": 4194304,
"prefix": ".ZQzVc6phBAMCv3lSbiHBo0fftkpXmjm_",
"rules": [
{
"key": 0,
"val": {
"start_part_num": 0,
"start_ofs": 4194304,
"part_size": 0,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
],
"tail_instance": "",
"tail_placement": {
"bucket": {
"name": "size-container",
"marker": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"bucket_id": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"placement_rule": "default-placement"
}
},
"attrs": {
"user.rgw.pg_ver": "��",
"user.rgw.source_zone": "eR[�\u0011",
"user.rgw.tail_tag": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11729649.143382",
"user.rgw.x-amz-meta-mtime": "1535100720.157102"
}
}

· One min read
Joachim Kraftmayer

incomplete state

The Ceph cluster has recognized that a placement group (PG) is missing important information. This may be missing information on any write operations that have occurred or that there are no error-free copies.

The recommendation is to bring all OSDs that are in the down or out state back into the Ceph cluster, as these could contain the required information. In the case of an Ereasure Coding (EC) pool, the temporary reduction of the min_size can enable recovery. However, the min_size cannot be smaller than the number of defined data shunks for this pool.

Sources

https://docs.ceph.com/docs/master/rados/operations/pg-states/ https://docs.ceph.com/docs/master/rados/operations/erasure-code/

· One min read
Joachim Kraftmayer

In a productive cluster, the removal of OSDs or entire hosts can affect regular operations for users, depending on the load. It is therefore recommended, for example, to gradually remove an OSD or a host from productive operation in order to ensure full replication over the entire process.

You could execute the commands manually step by step and always wait until the data has been completely redistributed in the cluster.

ceph osd crush reweight osd.<ID> 1.0
ceph osd crush reweight osd.<ID> 8.0
ceph osd crush reweight osd.<ID> 6.0
ceph osd crush reweight osd.<ID> 4.0
ceph osd crush reweight osd.<ID> 2.0
ceph osd crush reweight osd.<ID> 0.0

We wrote our own script for automation years ago, so it should also work with earlier versions, such as Hammer, Jewel, Kraken and Luminous.

ceph osd out <ID>
INITV: service ceph stop osd.<ID>
SYSTEMD: systemctl stop ceph-osd@<ID>
ceph osd crush remove osd.<ID>
ceph auth del osd.<ID>
ceph osd rm <ID>
Achtung beim Löschen von Elementen aus der CRUSHMAP fängt der Ceph Cluster an die Verteilung wieder auszugleichen.

· 2 min read
Joachim Kraftmayer

find user in user list

root@master.qa.cloud.clyso.com:~ # radosgw-admin user list
[
...
"57574cda626b45fba1cd96e68a57ced2",
...
"admin",
...
]

get infos for a specific user

radosgw-admin user info --uid=57574cda626b45fba1cd96e68a57ced2
{
"user_id": "57574cda626b45fba1cd96e68a57ced2",
"display_name": "qa-clyso-backup",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "keystone"
}

set the quota for one specific user

root@master.qa.cloud.clyso.com:~ # radosgw-admin quota set --quota-scope=user --uid=57574cda626b45fba1cd96e68a57ced2 --max-size=32985348833280```

## verify the set quota max_size and max_size_kb

```bash
root@master.qa.cloud.clyso.com:~ # radosgw-admin user info --uid=57574cda626b45fba1cd96e68a57ced2
{
"user_id": "57574cda626b45fba1cd96e68a57ced2",
"display_name": "qa-clyso-backup",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": 32985348833280,
"max_size_kb": 32212254720,
"max_objects": -1
},
"temp_url_keys": [],
"type": "keystone"
}

enable quota for one specific user

root@master.qa.cloud.clyso.com:~ # radosgw-admin quota enable --quota-scope=user --uid=57574cda626b45fba1cd96e68a57ced2
root@master.qa.cloud.clyso.com:~ # radosgw-admin user info --uid=57574cda626b45fba1cd96e68a57ced2
{
"user_id": "57574cda626b45fba1cd96e68a57ced2",
"display_name": "qa-clyso-backup",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": true,
"check_on_raw": false,
"max_size": 32985348833280,
"max_size_kb": 32212254720,
"max_objects": -1
},
"temp_url_keys": [],
"type": "keystone"
}

synchronize stats for one specific user

root@master.qa.cloud.clyso.com:~ # radosgw-admin user stats --uid=57574cda626b45fba1cd96e68a57ced2 --sync-stats
{
"stats": {
"total_entries": 10404,
"total_bytes": 54915680,
"total_bytes_rounded": 94674944
},
"last_stats_sync": "2017-08-21 07:09:58.909073Z",
"last_stats_update": "2017-08-21 07:09:58.906372Z"
}

Sources

https://docs.ceph.com/en/latest/radosgw/admin/)

· 2 min read
Joachim Kraftmayer

Again and again I come across people who use a replication of 2 for a replicated Ceph pool.

If you know exactly what you are doing, you can do this. I would strongly advise against it.

One simple reason for this is that you can't form a clear majority with two people, there always has to be at least a third.

There are error scenarios in which it can quickly happen that both OSDs (osdA and osdB) of a placement group (replication size = 2) are not available. If an osdA fails, the cluster only has one copy of the object and the default value (min_size = 2) on the pool means that the cluster would no longer allow any write operations to the object.

If min_size=1 (not recommended) then the osdB could be gone for a short time and the osdA would come back. Now osdA does not know whether further write operations have been carried out on osdB during its offline phase.

You can then only hope that all osds come back or you can then manually make the decision for the most current data set. While in the background more and more blocked_requests accumulate in the cluster that would like to access the data.

· 3 min read
Joachim Kraftmayer

With the default options you will see blocked requests in the cluster caused by deep-scrubing operations

recomended deep scrub options to minimize the impact of scrub/deep-scrub in the ceph cluster

the following options define the scrub and deep-scrub behaviour, by cpu load, osd scheduler priority, defined check intervals and ceph cluster health state

[osd]
#reduce scrub impact
osd max scrubs = 1
osd scrub during recovery = false
osd scrub max interval = 4838400 # 56 days
osd scrub min interval = 2419200 # 28 days
osd deep scrub interval = 2419200
osd scrub interval randomize ratio = 1.0
# osd deep scrub randomize ratio = 1.0
osd scrub priority = 1
osd scrub chunk max = 1
osd scrub chunk min = 1
osd deep scrub stride = 1048576 # 1 MB
osd scrub load threshold = 5.0
osd scrub sleep = 0.3
osd max scrubs
osd max scrubs

Description: The maximum number of simultaneous scrub operations for a Ceph OSD Daemon.

Type: 32-bit Int

Default: 1
osd scrub during recovery

Description: Allow scrub during recovery. Setting this to false will disable scheduling new scrub (and deep–scrub) while there is active recovery. Already running scrubs will be continued. This might be useful to reduce load on busy clusters.

Type: Boolean

Default: true
osd scrub min interval

Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon when the Ceph Storage Cluster load is low.

Type: Float

Default: Once per day. 60*60*24

5,14 6%
osd scrub interval randomize ratio

Description: Add a random delay to osd scrub min interval when scheduling the next scrub job for a placement group. The delay is a random value less than osd scrub min interval * osd scrub interval randomized ratio. So the default setting practically randomly spreads the scrubs out in the allowed time window of [1, 1.5] * osd scrub min interval.

Type: Float

Default: 0.5
osd scrub priority

Description: The priority set for scrub operations. It is relative to osd client op priority.

Type: 32-bit Integer

Default: 5

Valid Range: 1-63
osd scrub chunk max

Description: The maximum number of object store chunks to scrub during single operation.

Type: 32-bit Integer

Default: 25
osd scrub chunk min

Description: The minimal number of object store chunks to scrub during single operation. Ceph blocks writes to single chunk during scrub.

[docs.ceph.com/docs/master/rados/configuration/osd-config-ref/](http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/)

Default: 5
osd deep scrub stride

Description: Read size when doing a deep scrub.

Type: 32-bit Integer

Default: 512 KB. 524288
osd scrub load threshold

Description: The normalized maximum load. Ceph will not scrub when the system load (as defined by getloadavg() / number of online cpus) is higher than this number. Default is 0.5.

Type: Float

Default: 0.5
osd scrub sleep

Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow down whole scrub operation while client operations will be less impacted.

Type: Float

Default: 0

SOURCES

http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

https://indico.cern.ch/event/588794/contributions/2374222/attachments/1383112/2103509/Configuring_Ceph.pdf

https://github.com/ceph/ceph/blob/master/src/common/options.cc#L3130

· One min read
Joachim Kraftmayer

You can delete buckets and their contents with S3 Tools and Ceph's own board tools.

via S3 API

With the popular command line tool s3cmd, you can delete buckets with content via S3 API call as follows:

s3cmd rb --rekursives s3: // clyso_bucket

via radosgw-admin command

Radosgw-admin talks directly to the Ceph cluster and does not require a running radosgw process and is also the faster way to delete buckets with content from the Ceph cluster.

radosgw-admin bucket rm --bucket=clyso_bucket --purge-objects

If you want to delete an entire user and his or her data from the system, you can do so with the following command:

radosgw-admin user rm --uid=<username> --purge-data

Use this command wisely!