Skip to main content

Adding Capacity to Ceph -- the CLYSO Way!

· 2 min read
Dan van der Ster

One of my favourite things to assist users with is simplifying their workflows for making major changes to their Ceph clusters, such as adding or removing multiple hosts at once. Ceph is inherently excellent at handling these tasks – one of its greatest strengths is the ability to transparently add or remove capacity, replace servers, and perform maintenance, all without downtime.

However, at large scales – when storage systems are pushing the boundaries of capacity or performance – additional tools are often needed to manage these operations in a more controlled and efficient way.

Let’s get into the details. Suppose you want to grow your cluster significantly. Whether you use ceph orch apply osd or custom tooling, after adding the new capacity, your ceph status might look like this:

4518 active+remapped+backfill_wait
2.1B (28%) objects misplaced

This indicates that 28% of the objects need to be redistributed onto the new hosts. What’s more concerning is the progress bar:

Progress: Global Recovery Event [x................................] (6w)

In this example – admittedly an extreme case – the operation will take 6 weeks to complete! During this time, operators will have limited options to throttle, pause, revert, or abort the operation. What happens if a major unexpected event, like a power outage, occurs during that period? While Ceph will keep the data safe, the complexity and risk associated with such prolonged migrations are non-trivial.

This is why, at CLYSO, we strongly advocate for tools like upmap-remapped. This tool, which I originally developed during my time at CERN, has been a game-changer for Ceph maintenance. It provides enhanced control and reliability when making substantial changes to large clusters.

This week, we’ve published detailed documentation outlining our approach to adding capacity to large-scale Ceph clusters using this tool:

Improved Procedure for Adding Hosts or OSDs

We hope this resource proves valuable to the Ceph community. As always, if you need assistance with this or any other Ceph-related challenges, don’t hesitate to reach out to us at CLYSO!