Overview
Hardware prices are soaring, putting immense pressure on IT budgets to do more with less. Managing your storage efficiently is now a financial necessity. In this article, we look at the different possibilities and solutions available to reduce the data footprint in Ceph, allowing you to cut down on physical hardware requirements without sacrificing data reliability.
Consumer-grade hardware vs. Enterprise-grade hardware
-
Q: Can we reduce hardware costs by using consumer-grade drives instead of enterprise-grade hardware in our Ceph cluster?
A: While consumer-grade drives (such as desktop or gaming SSDs/HDDs) offer an attractive upfront price point, we strongly discourage using them in production Ceph environments. Consumer-grade hardware is simply not designed for the rigorous, 24/7 write-intensive workloads typical of a distributed storage cluster. Here is why choosing enterprise-grade hardware is critical for long-term cost optimization:
-
Performance Degradation: Consumer drives lack the sustained write performance and deep queue depths required by Ceph. Under continuous load, they suffer from severe latency spikes and throttling, which can degrade the performance of your entire cluster.
-
Data Durability & Power Loss Protection (PLP): Enterprise drives include capacitors to protect data in flight during a sudden power outage. Without PLP, consumer drives can suffer from data corruption or loss, undermining Ceph’s consistency guarantees.
-
Low Write Endurance: Ceph generates significant write amplification. Consumer drives have much lower Terabytes Written (TBW) ratings, meaning they will wear out and fail rapidly, ultimately driving up replacement costs and administrative overhead.
The Bottom Line: Cutting corners on drive quality leads to unpredictable performance, higher failure rates, and potential data loss. True cost optimization in Ceph comes from tuning software configurations (like compression and caching) on reliable, enterprise-grade hardware—not from compromising the physical layer.
-
Replication Factor Reduction
-
Q: What are the risks of configuring new Ceph pools with RF2 (Replication Factor 2) or RF1 (Replication Factor 1) for RBD and CephFS pools?
A: Clyso does not support a replication factor of 2 (RF2). Historical data shows a high risk of data loss when using a replication factor of less than 3 (like size = 2,
min_size=1), often resulting from concurrent drive failures, degraded states caused by flapping OSDs, or human error during maintenance operations.
Erasure Coding
-
Q: What workload types are a good or bad fit for EC?
A: EC is ideal for large datasets characterized by write-once/read-occasional access or high-throughput streaming. This aligns perfectly with S3 object storage workloads involving immutable data, such as photos, videos, and backups. Conversely, EC should be avoided for workloads requiring high random IOPS, such as small-file logging or database operations on RBD and CephFS.
Note that EC (erasure coding) optimizations, also refered to as "FastEC", are coming into Ceph Tentacle and will help RBD and CephFS uses cases over EC. Here is the technical document linked to these optimizations: https://docs.ceph.com/en/latest/dev/osd_internals/erasure_coding/enhancements and a blog article: https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates. We do not recommend enabling and using FastEC in production as of mid-2026, though.
-
Q: What is the expected performance impact of EC (erasure coding) compared to replicated pools, specifically around read/write latency and IOPS? Are there workloads (e.g. random small I/O vs large sequential) where the degradation is significant enough to be a concern?
A: As of mid-2026 (without FastEC of Tentacle), the performance drop of EC compared to RF3 is quite high (at least 30%) for most RBD and CephFS workloads. It's even worse with databases or transactional VMs, where IOPS loss is closer to 50–70% and latency increases by a factor of 2× to 4×. If your workload is based on small IOs and a high number of iops and low latency are expected, stick with RF3.
In general, for RBD- and CephFS-based workloads, we recommend staying with RF3. However, one could have RBD/CephFS data pools on RF3 and EC pools and choose to use one or the other based on the type of data and access pattern (DB engines, backup software, etc.). For RBD workloads, images could be provisioned in an EC pool (instead of RF3 pool) for big-files workloads. For CephFS, parts of the filesystem (subfolders) could be linked to the EC pool using file layouts to achieve the same.
For S3 storage, you can use S3 storage classes, and Lua and LifeCycle policies for dynamic placement and optimized retention: https://github.com/FredNass/s3-dynamic-placement-and-archiving
Bluestore-compression-related content
-
Q: Are there any risks or performance implications when you enable aggressive compression at the pool level?
A: No. Enabling aggressive compression at the pool level is the only way to ensure that compression will happen regardless of whether clients will provide compression hints and get existing data compressed (see below). Using aggressive compression is OK.
-
Q: What is the CPU overhead of enabling compression at the OSD level?
A: The only answers possible are contingent upon the setup of your cluster. The overhead heavily depends on the compression_mode, data compressibility, the specific compression ratios set (like
bluestore_compression_required_ratioand/orcompression_required_ratio), as well as the IO size and flow patterns. -
Q: Is there a risk of compression negating itself if tenants like Kafka are already sending compressed data to Ceph?
A: Yes, there is. However, thanks to
bluestore_compression_required_ratiothe compression will not occur:# ceph config help bluestore_compression_required_ratiobluestore_compression_required_ratio- Compression ratio required to store compressed data (float, advanced) Default: 0.875000 Can update at runtime: trueIf we compress data and get less than this, we discard the result and store the original uncompressed data.
Note that one can also set 'compression_required_ratio' on a per pool basis (rather than setting
bluestore_compression_required_ratioat the OSD level). https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#inline-compression -
Q: After enabling Bluestore compression, is there a way to compress existing data?
A: Yes, you can trigger compression on existing data by forcing PG migration. The simplest method is to sequentially mark OSDs
outand then backinone at a time. This forces the data to be compressed as it is rewritten to new OSDs. -
Q: After enabling Bluestore compression,
ceph df detailoutput shows a perfect 2x compression ratio for all data. Is it expected?A: As of May 2026, the current BlueStore compression ratio is limited to 2x, despite standard algorithms like snappy being able to achieve better results. This is due to a
bluestore_compression_min_blob_size_hddthat is set to 8K by default in Ceph. You can find more information and a permanent fix in PR#67421. Until that fix is merged, you can achieve a better compression ratio by increasing this value (e.g., to 64K) and restarting the OSDs. In the near future, restarting the OSDs will no longer be required, thanks to PR#67433. -
Q: Should I enable BlueStore compression on specific pools, or on many pools?
A: Prefer enabling BlueStore compression on specific pools only.
-
Q: What are the limits of the BlueStore compression ratio?
A: As of May 2026, the current BlueStore compression ratio is limited to 2x, despite the fact that standard algorithms like snappy are able to achieve much better results. This is due to a
bluestore_compression_min_blob_size_hddthat is set to 8K by default in Ceph. You can find more information and a permanent fix in PR#67241. Until that fix is merged, you can achieve a better compression ratio by increasing this value (e.g., to 64K) and restarting the OSDs. In the near future, restarting the OSDs will no longer be required, thanks to PR#67433.
RGW-compression-related Questions
-
Q: Is it better to enable compression at the RGW level or at the Bluestore (OSD) level?
A: Compressing at the RGW level is preferred over BlueStore for several key reasons:
- Larger datasets allow for much better compression efficiency.
- Transferring compressed data to the OSDs saves network bandwidth and maximizes throughput to the drives.
- Generating fewer RADOS objects speeds up recovery and backfilling tasks.
- Doing compression on the RGW compresses the data only once. BlueStore compresses the data multiple times (once for each OSD involved).
To compress new data and existing data, you can refer to Frédéric's article and talk on Dynamic Placement and Optimized Retention.
-
Q: How can I enable RGW compression?
A: You can configure RGW compression per storage class. For more information, please refer to the upstream documentation, Frédéric's article and talk on Dynamic Placement and Optimized Retention, or Bryan's Squishing Squid webinar.
-
Q: Q: What compression algorithm and level would you recommend using?
A: It depends on your workload.
zstdis a good all-around choice that achieves good compression ratios and good speeds. On the other hand,LZ4is preferred when you want the fastest compression and decompression speeds, but it comes at the cost of lower compression ratios. For a comparision of these algorithms, see Bryan's Squishing Squids talk.Furthermore, Ceph allows you to deploy multiple storage classes with different compression algorithms, giving you the flexibility to dynamically assign the optimal algorithm for each workload, bucket, tenant, object size or type, etc.
-
Q: Can RGW compression use hardware offloading?
A: Yes, RGW compression can use hardware offloading via Intel QuickAssist Technology (QAT) or NVIDIA nvCOMP. Starting in Tentacle, Ceph supports ARM64 CPUs from Huawei (for example, Hisilicon Kunpeng 920) using UADK.
-
Q: After enabling RGW compression on a storage pool, is there a way to compress existing data?
A: Yes (eventually). By using Lifecycle Policies, users can transition data to another storage class that has compression enabled. As the data moves to the new storage class, it will be compressed. Ensure that your current Ceph release supports this feature for the compression to operate correctly. This feature will be supported only when upstream pull request 67725 has been merged and a release is built with it.
-
Q: Can I force new uploads to land in the compressed storage class?
A: Yes, you can use Lua for that, and you can do so based on the object size, tenant, or bucket name, etc. You can refer to the Dynamic Placement and Optimized Retention article for more information.
RAM Recommendations
-
Q: How much RAM should be allocated per OSD to optimize cost and performance?
A: The ideal RAM allocation should be determined by monitoring the actual usage of the OSD metadata cache (onode cache), as it largely depends on your workload. You can analyze this by running a
ceph tell osd.x perf dumpon OSDs with a significant uptime. (This command retrieves a comprehensive JSON blob detailing life performance metrics including IOPS, byte throughput, latency, and queue depths.) To ensure optimal OSD performance, the cache hit/miss ratio should ideally be above 85% to 90%. If you achieve this target ratio with anosd_memory_targetof 8 GB of RAM, there is no need to allocate 16 GB. Monitoring this metric provides valuable insights for sizing the memory of your future hardware deployments, allowing you to right-size your servers and effectively limit infrastructure costs.
RGW-deduplication-related Questions
-
Q: What's the current status of S3 object deduplication in Ceph?
A: S3 object deduplication was added to Ceph Tentacle. Initially, Ceph will allow users to evaluate the benefits of deduplication using specific extensions to the radosgw-admin command (e.g.,
dedup estimate,dedup stats). In a later phase, Ceph will introduce background deduplication processing. Deduplication can be managed per bucket and will apply to both compressed and versioned objects. For more information, see the public documentation on Full RGW Object Dedup. Associated PRs: 62179, 68860, 68965
General thoughts
-
Q: Are there any other ideas or recommendations that might help improve clusters that have capacity constraints?
A: Clyso's general recommendation is to prioritize adding storage capacity first. If that isn't feasible, use compression selectively for specific use cases involving large files and compressible data. If you have the bandwidth, Clyso encourages you to test Tentacle and FastEC in a lab environment; ideally, you could reproduce a specific workload using a representative sample of your existing data to see how it performs. Ensure that you do not use Tentacle or FastEC in production yet, as they are not quite ready (as of May 2026).