RGW Segmentation Fault During Multi-Object Delete: OLH Corruption and Lifecycle Worker Race Condition

Issue Overview

RGW (RADOS Gateway) crashes are occurring in a cluster running Ceph version 17.2.7 (Quincy). The crashes are triggered by a segmentation fault during multi-object delete operations combined with garbage collection activity.

NOTE: This issue has been repaired in Ceph Squid (v19). All versions of Ceph starting from 19.1.0 contain a fix for this issue.

Key Error Details

OLH Update Failures (Error -125)

ERROR: could not apply olh update, r=-125
ERROR: could not clear bucket index olh entries r=-125

Error Code: -125 = ECANCELED
Affected Component: OLH (Object Logical Head) - objects that RGW creates to point to the most recent version of a given RGW object
Impact: RadosGW cannot update the OLH object or the references to OLH objects in the bucket index during multi-object delete operations
Related Upstream Tracker: https://tracker.ceph.com/issues/66089
Reference Documentation: https://docs.ceph.com/en/reef/dev/radosgw/bucket_index/

Garbage Collection Errors (Error -28)

garbage collection: RGWGC::send_split_chain - send chain returned error: -28

Error Code: -28 = ENOSPC (No space left)
Issue: The garbage collection queue has run out of space
Related Upstream Tracker: https://tracker.ceph.com/issues/66457 (note: this tracker does not include a segmentation fault)

Crash Details

Crash Type: Segmentation fault in thread radosgw
Crash Location: std::_Rb_tree_insert_and_rebalance during red-black tree operations within the rgw_filter_attrset function
Timing: Errors and crash occur within seconds of each other (14:06:48 - 14:06:49 UTC on Sept 3, 2024)

Stack Trace Path:

Multi-object delete operation (RGWDeleteMultiObj::handle_individual_object)
Object state retrieval (get_obj_state)
Raw object stat (RGWRados::raw_obj_stat)
Attribute filtering (rgw_filter_attrset)
Tree rebalancing operation (crash point)

Root Cause Analysis

Initial Theory: The OLH objects may be missing or corrupted. This corruption could be preventing proper cleanup during garbage collection, causing the GC queue to run out of space and ultimately leading to the segmentation fault.

Chain of Events:

OLH objects become corrupted or go missing
Multi-object delete operations fail to update OLH references
Failed deletions prevent proper garbage collection
GC queue fills up (ENOSPC)
System attempts to handle the corrupted state, triggering segmentation fault

Segmentation Fault Analysis

Similar Upstream Issue

The stacktrace closely resembles the one documented in this GitHub PR discussion during get_obj_state: https://github.com/ceph/ceph/pull/55657#discussion_r1566457876

This similarity suggests that lifecycle worker threads may be involved with concurrent updates to bucket->handle(attrs).

Potential Workarounds

Option 1: Disable Lifecycle Threads on Affected RGW Daemons

Set rgw_enable_lc_threads = false on the specific RGW daemon(s) experiencing crashes. When this setting is disabled, the lifecycle worker count settings (rgw_lc_max_worker and rgw_lc_max_wp_worker) are not relevant for that daemon.

# Disable lifecycle threads on specific RGW daemon
ceph config set client.rgw.<daemon-id> rgw_enable_lc_threads false

# Or disable for all RGW daemons
ceph config set client.rgw rgw_enable_lc_threads false

# Restart the affected RGW daemon(s) for the change to take effect
ceph orch restart rgw.<service-name>

Option 2: Limit Lifecycle Worker Concurrency

For RGW daemons that need to continue performing lifecycle (LC) work, consider setting both lifecycle worker parameters to 1:

# Set lifecycle worker limits on specific RGW daemon
ceph config set client.rgw.<daemon-id> rgw_lc_max_worker 1
ceph config set client.rgw.<daemon-id> rgw_lc_max_wp_worker 1

# Or set for all RGW daemons
ceph config set client.rgw rgw_lc_max_worker 1
ceph config set client.rgw rgw_lc_max_wp_worker 1

# Restart the affected RGW daemon(s) for the change to take effect
ceph orch restart rgw.<service-name>

Verify Configuration:

# Check current configuration for a specific daemon
ceph config show client.rgw.<daemon-id> | grep -E 'rgw_enable_lc_threads|rgw_lc_max_worker|rgw_lc_max_wp_worker'

# Check configuration for all RGW daemons
ceph config dump | grep -E 'rgw_enable_lc_threads|rgw_lc_max_worker|rgw_lc_max_wp_worker'

Performance Considerations:

Limiting worker counts to 1 may have a performance impact in some situations
This is a diagnostic step to determine if lifecycle worker concurrency is the root cause
If the segmentation faults stop occurring with limited concurrency, it confirms the race condition theory

The following PR may address this issue: https://github.com/ceph/ceph/pull/56712

This fix should be evaluated to determine if it resolves the segmentation fault related to concurrent attribute updates.

Additional Context

Request showing successful operation just before crash: req=0x7f9b89be1710 completed with op status=0 http_status=200
Client: xx.xxx.xxx.xxx performing operations on bucket xxx-prod-xxxx-fs01
The combination of multi-object delete operations, garbage collection errors, OLH update failures, and lifecycle worker thread activity suggests a race condition during concurrent attribute updates on versioned objects

Issue Overview​

Key Error Details​

OLH Update Failures (Error -125)​

Garbage Collection Errors (Error -28)​

Crash Details​

Stack Trace Path:​

Root Cause Analysis​

Segmentation Fault Analysis​

Similar Upstream Issue​

Potential Workarounds​

Potentially Related Fix​

Additional Context​