Skip to main content

OSD Removal Stuck in Zap Loop

Issue Overview

When removing an OSD with the --replace and --zap flags using ceph orch osd rm, the removal process can become stuck in an infinite loop if the underlying ceph-volume lvm zap command fails. This occurs because in some cases the OSD id is present in the crush map but does not have an associated Ceph volume. This can happen when a drive is removed before it has been cleaned up on the cephadm side.

Symptoms

The OSD removal status shows the OSD in a done, waiting for purge state:

ceph orch osd rm status

Output:

OSD  HOST              STATE                    PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
5 tot-p-ceph-0036 done, waiting for purge 0 True False True

The OSD appears as destroyed in the OSD tree:

5   nvme     13.97260                  osd.5             destroyed         0  1.00000

Cluster logs show repeating messages every few seconds indicating the zap operation is being attempted repeatedly:

$ ceph log last cephadm
2026-03-26T16:46:19.426379+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323365 : cephadm [INF] osd.5 now down
2026-03-26T16:46:19.427843+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323366 : cephadm [INF] Daemon osd.5 on tot-p-ceph-0036 was already removed
2026-03-26T16:46:19.428247+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323367 : cephadm [INF] Successfully destroyed old osd.5 on tot-p-ceph-0036; ready for replacement
2026-03-26T16:46:19.428270+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323368 : cephadm [INF] Zapping devices for osd.5 on tot-p-ceph-0036
2026-03-26T16:46:31.273497+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323392 : cephadm [INF] osd.5 now down
2026-03-26T16:46:31.274646+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323393 : cephadm [INF] Daemon osd.5 on tot-p-ceph-0036 was already removed
2026-03-26T16:46:31.274952+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323394 : cephadm [INF] Successfully destroyed old osd.5 on tot-p-ceph-0036; ready for replacement
2026-03-26T16:46:31.274983+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323395 : cephadm [INF] Zapping devices for osd.5 on tot-p-ceph-0036
2026-03-26T16:46:37.914310+0000 mgr.tot-p-ceph-0011.fjhcni (mgr.404669891) 323409 : cephadm [INF] osd.5 now down

Root Cause

The issue occurs when:

  1. An OSD is removed using the command:

    ceph orch osd rm 5 --replace --zap
  2. The OSD does not have a ceph-volume 'attached' to the OSD ID

  3. During the --zap path, the command ceph-volume lvm zap --osd-id 5 fails or errors

  4. The removal process does not check the return output of the do_zap() function

  5. Without proper error handling, the process continues to retry the zap operation indefinitely, creating an infinite loop

This issue is tracked in the Ceph issue tracker. Additionally, cephadm itself gets stuck and cannot proceed with other operations unless this removal issue is resolved.

Impact

When this issue occurs:

  • The OSD removal operation remains stuck in the queue
  • Cephadm operations may be blocked or degraded
  • The cluster continues to attempt zapping operations that cannot succeed
  • Other OSD management operations may be affected

Workaround

Since the OSD is already in a destroyed state and the daemon has been removed, you can manually complete the removal process:

This method avoids rebalancing by directly editing the removal queue:

Step 1: Check Current Removal Status

Verify the OSD is stuck in removal:

ceph orch osd rm status

Output:

OSD  HOST              STATE                    PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
5 tot-p-ceph-0036 done, waiting for purge 0 True False True

Step 2: View the Removal Queue

Check the current state of the removal queue:

ceph config-key get mgr/cephadm/osd_remove_queue

Output:

[{"osd_id": 5, "started": true, "draining": false, "stopped": false, "replace": true, "force": false, "zap": true, "hostname": "tot-p-ceph-0036", "original_weight": 13.97259521484375, "drain_started_at": null, "drain_stopped_at": null, "drain_done_at": "2026-03-26T16:44:09.812623Z", "process_started_at": "2026-03-26T16:44:03.823270Z"}]

Step 3: Clear the Removal Queue

Set the removal queue to empty:

ceph config-key set mgr/cephadm/osd_remove_queue []

Step 4: Restart the Manager

Fail over to a new manager to apply the changes:

ceph mgr fail

Step 5: Verify Removal Queue is Clear

Confirm the OSD removal operation has been cleared:

ceph orch osd rm status

Expected output:

No OSD remove/replace operations reported

Warning: Using ceph orch osd rm stop <osd_id> will mark the OSD as IN and cause misplaced objects, triggering cluster rebalancing. This is why Option 1 (editing the removal queue) is preferred.

If you must use this method:

ceph orch osd rm stop 5

This will stop the removal operation but will likely cause unnecessary data movement in the cluster.

Step 1: Verify OSD State

Option 3: Manual Device Cleanup (If Needed)

If the devices still need to be zapped after completing Option 1, verify the OSD state first:

ceph osd tree | grep osd.5
ceph orch ps | grep osd.5

Then clean up devices if necessary:

# Identify the devices used by OSD 5
ceph-volume lvm list
lsblk

# Manually zap the devices (replace with actual device paths)
ceph-volume lvm zap /dev/sdX --destroy

Verification

After completing the workaround, verify the OSD has been completely removed and the cluster is healthy:

ceph osd tree
ceph orch osd rm status
ceph health

Prevention

To avoid this issue in future OSD removals:

  • Verify that OSDs have properly attached ceph-volumes before attempting removal
  • Check for any underlying ceph-volume issues before initiating removal
  • Monitor the removal process and intervene early if loops are detected
  • If an OSD removal gets stuck, use the removal queue editing method (Option 1) to avoid unnecessary rebalancing
  • Avoid using ceph orch osd rm stop as it causes the OSD to be marked IN and triggers data movement

This is a known issue where the cephadm removal process does not properly handle do_zap() function failures. The lack of return code checking causes the process to assume success and continue retrying indefinitely.

Notes

  • The OSD being in destroyed state indicates the logical removal was successful
  • The stuck loop is specifically related to the zap operation failure
  • Manual intervention is safe when the OSD is already destroyed and removed from the cluster
  • This issue highlights the importance of proper error handling in the OSD removal workflow