Skip to main content

12 posts tagged with "s3"

View All Tags

· 4 min read
Artem Torubarov

Today, we're excited to share that we've released the Chorus project under the Apache 2.0 License. In this blog post, let's talk about what Chorus is and why we made it.

At Clyso, we frequently assist our customers in migrating infrastructure, whether to or from the cloud, or between different cloud providers. Our focus often centers around storage, particularly S3.

Like many others in the field, we initially relied on the fantastic Rclone tool, which excelled at the task. However, as we encountered challenges while attempting to migrate 100TB bucket with 100M objects, we recognized the need for an additional layer of automation. Migrating large buckets within a reasonable timeframe requires a machine with substantial RAM and network bandwidth to take advantage of the parallelism options provided by Rclone.

Yet, even with powerful machines, the risk of network problems or VM restarts interrupting the synchronization process remained. While Rclone handles restarts admirably by comparing object size, ETag, and modification time, the process becomes time-consuming and incurs additional costs for cloud-based S3, especially with very large buckets.

The missing piece in our puzzle was the ability to run Rclone on multiple machines for improved hardware utilization and the ability to track and store progress on remote persistent storage. With these goals in mind, we developed Chorus - a vendor-agnostic S3 backup, replication, and routing software. Written in Go, Chorus uses Rclone for S3 object copying, Redis for progress tracking, and Asynq work queue for load distribution across multiple machines.

· One min read
Joachim Kraftmayer

Commvault has been in use as a data protection solution for years and is now looking to replace its existing storage solution (EMC), for its entire customer environments.

Commvault provides data backup through a single interface. Through the gradual deployment of Ceph S3 in several expansion stages, the customer built confidence in Ceph as a storage technology and more and more backups are gradually being transferred to the new backend.

In the first phase, Ceph S3 was allowed to excel in its performance and scalability capabilities.

In the following phases, the focus will be on flexibility and use as unified storage for cloud computing and Kubernetes.

For all these scenarios, the customer relies on Ceph as an extremely scalable, high-performance and cost-effective storage backend.

Over 1 PB of backup data and more than 500 GBytes per hour of backup throughput can be easily handled by Ceph S3 and it is ready to grow even further with the requirements in the future.

After in-depth consultation, we were able to exceed the customer’s expectations for the Ceph cluster in production.

· One min read
Joachim Kraftmayer

The customer uses Commvault as a data backup solution for their entire customer environments.

Wherever the data resides, Commvault provides the backup of the data through a single interface. The customer thus avoids costly data loss scenarios, disconnected data silos, lack of recovery SLAs and inefficient scaling.

For all these scenarios, the customer relies on Ceph as a powerful and cost-effective storage backend for Commvault.

With over 2 PB of backup data and more than 1 TByte per hour of backup throughput, Ceph can easily handle and is ready to grow even further with the requirements in the future.

In conclusion, we were able to clearly exceed the customer’s expectations of the Ceph Cluster already in the test phase.

· 2 min read
Joachim Kraftmayer

motivation

we have tested ceph s3 in openstack swift intensively before. We were interested in the behavior of the radosgw stack in ceph. We paid particular attention to the size and number of objects in relation to the resource consumption of the radosgw process. Effects on response latencies of radosgw were also important to us. To be able to plan the right sizing of the physical and virtual environments.

technical topics​

From a technical point of view, we were interested in the behavior of radosgw in the following topics.

  • dynamic bucket sharding
  • http frontend difference between Civetweb and Beast
  • index pool io pattern and latencies
  • data pool io pattern and latencies with erasure-coded and replicated pools
  • fast_read vs. standard read for workloads with large and small objects.

requirements

when choosing the right tool, it was important for us to be able to test both small and large ceph clusters with several thousand osds.

We want to use the test results as a file for evaluation as well as have a graphical representation as timeseries data.

For timeseries data we rely on the standard stack with Grafana, Prometheus and Thanos.

the main prometheus exporters we use are ceph-mgr-exporter and node-exporter.

load and performance tools​

CBT - The Ceph Benchmarking Tool

CBT is a testing harness written in python

https://github.com/ceph/cbt

s3 - tests

This is a set of unofficial Amazon AWS S3 compatibility tests

https://github.com/ceph/s3-tests

COSBench - Cloud Object Storage Benchmark

COSBench is a benchmarking tool to measure the performance of Cloud Object Storage services.

https://github.com/intel-cloud/cosbench

Gosbench

Gosbench is the Golang reimplementation of Cosbench. It is a distributed S3 performance benchmark tool with Prometheus exporter leveraging the official Golang AWS SDK

https://github.com/mulbc/gosbench

hsbench

hsbench is an S3 compatable benchmark originally based on wasabi-tech/s3-benchmark.

https://github.com/markhpc/hsbench

Warp

Minio - S3 benchmarking tool.

https://github.com/minio/warp

the tool of our choice​

getput

getput can be run individually on a test client.

gpsuite is responsible for synchronization and scaling across any number of test clients. Communication takes place via ssh keys and the simultaneous start of all s3 test clients is synchronized over a common time base.

Installation on linux as script or as container is supported.

https://github.com/markseger/getput

· One min read
Joachim Kraftmayer

There are several testing frameworks for Openstack Swift, but most of them are no longer maintained or have some other problems to test larger Ceph clusters properly under load.

At this point we decided to use getput by Mark Seger.

Test Setup

Ceph Cluster

Our Ceph test cluster consisted of 108 Ceph nodes with a total of 2592 osds. The network was a spine-leaf architecture.

Openstack Compute Nodes

12 compute nodes with 56 HT cores and 100 GBit/s network connectivity.

Test setup

gpsuite

Sources:

github.com/markseger/getput

· 3 min read
Joachim Kraftmayer

We encountered the first large omap objects in one of our Luminous Ceph clusters in Q3 2018 and worked with a couple of Ceph Core developers on the solution for internal management of RadosGW objects. This included topics such as large omap objects, dynamic resharding, multisite, deleting old object instances in the RadosGW index pool, and many small changes that were included in the Luminous, Mimic, and subsequent versions.

Here is a step by step guide on how to identify large omap objects and buckets and then manually reshard the affected objects.

output ceph status

ceph -s

cluster:
id: 52296cfd-d6c6-3129-bf70-db16f0e4423d
health: HEALTH_WARN
1 large omap object

output ceph health detail

ceph health detail
HEALTH_WARN 1 large omap objects
1 large objects found in pool 'clyso-test-sin-1.rgw.buckets.index'
Search the cluster log for 'Large omap object found' for more details.
search the ceph.log of the Ceph cluster:
2018-09-26 12:10:38.440682 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77104 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2018-09-26 12:10:35.037753 osd.1262 osd.1262 192.168.130.31:6836/10060 152 : cluster [WRN] Large omap object found. Object: 28:18428495:::.dir.143112fc-1178-40e1-b209-b859cd2c264c.38511450.376:head Key count: 2928429 Size (bytes): 861141085
2018-09-26 13:00:00.000103 mon.clyso1-mon1 mon.0 192.168.130.20:6789/0 77505 : cluster [WRN] overall HEALTH_WARN 1 large omap objects

From the ceph.log we extract the bucket instance, in this case:

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376 and look for it in the RadosGW metadata

root@salt-master1.clyso.test:~ # radosgw-admin metadata list "bucket.instance" | egrep "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
"b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376"
root@salt-master1.clyso.test:~ #

The instance exists and we checked the metadata of the instance.

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376
{
"key": "bucket.instance:b1868d6d-9d61-49b0-b101-c89207009b16:143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"ver": {
"tag": "_Ehz5PYLhHBxpsJ_s39lePnX",
"ver": 7
},
"mtime": "2018-04-24 10:02:32.362129Z",
"data": {
"bucket_info": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"creation_time": "2018-02-20 20:58:51.125791Z",
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"flags": 0,
"zonegroup": "1c44aba5-fe64-4db3-9ef7-f0eb30bf5d80",
"placement_rule": "default-placement",
"has_instance_obj": "true",
"quota": {
"enabled": true,
"check_on_raw": true,
"max_size": 54975581388800,
"max_size_kb": 53687091200,
"max_objects": -1
},
"num_shards": 0,
"bi_shard_hash_type": 0,
"requester_pays": "false",
"has_website": "false",
"swift_versioning": "false",
"swift_ver_location": "",
"index_type": 0,
"mdsearch_config": [],
"reshard_status": 0,
"new_bucket_instance_id": ""
},
"attrs": [
{
"key": "user.rgw.acl",
"val": "AgK4A.....AAAAAAA="
},
{
"key": "user.rgw.idtag",
"val": ""
},
{
"key": "user.rgw.x-amz-read",
"val": "aW52YWxpZAA="
},
{
"key": "user.rgw.x-amz-write",
"val": "aW52YWxpZAA="
}
]
}
}
root@salt-master1.clyso.test:~ #

get the metadata infos from the bucket

root@salt-master1.clyso.test:~ # radosgw-admin metadata get bucket:b1868d6d-9d61-49b0-b101-c89207009b16
{
"key": "bucket:b1868d6d-9d61-49b0-b101-c89207009b16",
"ver": {
"tag": "_WaSWh9mb21kEjHCisSzhWs8",
"ver": 1
},
"mtime": "2018-02-20 20:58:51.152766Z",
"data": {
"bucket": {
"name": "b1868d6d-9d61-49b0-b101-c89207009b16",
"marker": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"bucket_id": "143112fc-1178-40e1-b209-b859cd2c264c.38511450.376",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "d7a84e1aed9144919f8893b7d6fc5b02",
"creation_time": "2018-02-20 20:58:51.125791Z",
"linked": "true",
"has_bucket_info": "false"
}
}
root@salt-master1.clyso.test:~ #

grep for the bucket_id in the radosgw index pool

root@salt-master1.clyso.test:~ # rados -p eu-de-200-1.rgw.buckets.index ls | egrep “143112fc-1178-40e1-b209-b859cd2c264c.38511450.376” | wc -l
1
root@salt-master1.clyso.test:~ #

the bucket rados object, that has to be resharded

143112fc-1178-40e1-b209-b859cd2c264c.38511450.376

· One min read
Joachim Kraftmayer

If you quickly need the syntax for the radosgw-admin command.

clyso-ceph-rgw-client:~/clyso # radosgw-admin object stat --bucket=size-container --object=clysofile


{
"name": "clysofile",
"size": 26,
"policy": {
"acl": {
"acl_user_map": [
{
"user": "clyso-user",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "clyso-user",
"grant": {
"type": {
"type": 0
},
"id": "clyso-user",
"email": "",
"permission": {
"flags": 15
},
"name": "clyso-admin",
"group": 0,
"url_spec": ""
}
}
]
},
"owner": {
"id": "clyso-user",
"display_name": "clyso-admin"
}
},
"etag": "clyso-user",
"tag": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11729649.143382",
"manifest": {
"objs": [],
"obj_size": 26,
"explicit_objs": "false",
"head_size": 26,
"max_head_size": 4194304,
"prefix": ".ZQzVc6phBAMCv3lSbiHBo0fftkpXmjm_",
"rules": [
{
"key": 0,
"val": {
"start_part_num": 0,
"start_ofs": 4194304,
"part_size": 0,
"stripe_max_size": 4194304,
"override_prefix": ""
}
}
],
"tail_instance": "",
"tail_placement": {
"bucket": {
"name": "size-container",
"marker": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"bucket_id": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11750341.561",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"placement_rule": "default-placement"
}
},
"attrs": {
"user.rgw.pg_ver": "��",
"user.rgw.source_zone": "eR[�\u0011",
"user.rgw.tail_tag": "d667b6f1-5737-4f5e-bad0-fc030f0a4e94.11729649.143382",
"user.rgw.x-amz-meta-mtime": "1535100720.157102"
}
}

· One min read
Joachim Kraftmayer

incomplete state

The Ceph cluster has recognized that a placement group (PG) is missing important information. This may be missing information on any write operations that have occurred or that there are no error-free copies.

The recommendation is to bring all OSDs that are in the down or out state back into the Ceph cluster, as these could contain the required information. In the case of an Ereasure Coding (EC) pool, the temporary reduction of the min_size can enable recovery. However, the min_size cannot be smaller than the number of defined data shunks for this pool.

Sources

https://docs.ceph.com/docs/master/rados/operations/pg-states/ https://docs.ceph.com/docs/master/rados/operations/erasure-code/