RGW Segmentation Fault During Multi-Object Delete: OLH Corruption and Lifecycle Worker Race Condition
Issue Overview
RGW (RADOS Gateway) crashes are occurring in a cluster running Ceph version 17.2.7 (Quincy). The crashes are triggered by a segmentation fault during multi-object delete operations combined with garbage collection activity.
NOTE: This issue has been repaired in Ceph Squid (v19). All versions of Ceph starting from 19.1.0 contain a fix for this issue.
Key Error Details
OLH Update Failures (Error -125)
ERROR: could not apply olh update, r=-125
ERROR: could not clear bucket index olh entries r=-125
- Error Code: -125 = ECANCELED
- Affected Component: OLH (Object Logical Head) - objects that RGW creates to point to the most recent version of a given RGW object
- Impact: RadosGW cannot update the OLH object or the references to OLH objects in the bucket index during multi-object delete operations
- Related Upstream Tracker: https://tracker.ceph.com/issues/66089
- Reference Documentation: https://docs.ceph.com/en/reef/dev/radosgw/bucket_index/
Garbage Collection Errors (Error -28)
garbage collection: RGWGC::send_split_chain - send chain returned error: -28
- Error Code: -28 = ENOSPC (No space left)
- Issue: The garbage collection queue has run out of space
- Related Upstream Tracker: https://tracker.ceph.com/issues/66457 (note: this tracker does not include a segmentation fault)
Crash Details
- Crash Type: Segmentation fault in thread
radosgw - Crash Location:
std::_Rb_tree_insert_and_rebalanceduring red-black tree operations within thergw_filter_attrsetfunction - Timing: Errors and crash occur within seconds of each other (14:06:48 - 14:06:49 UTC on Sept 3, 2024)
Stack Trace Path:
- Multi-object delete operation (
RGWDeleteMultiObj::handle_individual_object) - Object state retrieval (
get_obj_state) - Raw object stat (
RGWRados::raw_obj_stat) - Attribute filtering (
rgw_filter_attrset) - Tree rebalancing operation (crash point)
Root Cause Analysis
Initial Theory: The OLH objects may be missing or corrupted. This corruption could be preventing proper cleanup during garbage collection, causing the GC queue to run out of space and ultimately leading to the segmentation fault.
Chain of Events:
- OLH objects become corrupted or go missing
- Multi-object delete operations fail to update OLH references
- Failed deletions prevent proper garbage collection
- GC queue fills up (ENOSPC)
- System attempts to handle the corrupted state, triggering segmentation fault
Segmentation Fault Analysis
Similar Upstream Issue
The stacktrace closely resembles the one documented in this GitHub PR
discussion during get_obj_state:
https://github.com/ceph/ceph/pull/55657#discussion_r1566457876
This similarity suggests that lifecycle worker threads may be involved with
concurrent updates to bucket->handle(attrs).
Potential Workarounds
Option 1: Disable Lifecycle Threads on Affected RGW Daemons
Set rgw_enable_lc_threads = false on the specific RGW daemon(s) experiencing
crashes. When this setting is disabled, the lifecycle worker count settings
(rgw_lc_max_worker and rgw_lc_max_wp_worker) are not relevant for that
daemon.
# Disable lifecycle threads on specific RGW daemon
ceph config set client.rgw.<daemon-id> rgw_enable_lc_threads false
# Or disable for all RGW daemons
ceph config set client.rgw rgw_enable_lc_threads false
# Restart the affected RGW daemon(s) for the change to take effect
ceph orch restart rgw.<service-name>
Option 2: Limit Lifecycle Worker Concurrency
For RGW daemons that need to continue performing lifecycle (LC) work, consider setting both lifecycle worker parameters to 1:
# Set lifecycle worker limits on specific RGW daemon
ceph config set client.rgw.<daemon-id> rgw_lc_max_worker 1
ceph config set client.rgw.<daemon-id> rgw_lc_max_wp_worker 1
# Or set for all RGW daemons
ceph config set client.rgw rgw_lc_max_worker 1
ceph config set client.rgw rgw_lc_max_wp_worker 1
# Restart the affected RGW daemon(s) for the change to take effect
ceph orch restart rgw.<service-name>
Verify Configuration:
# Check current configuration for a specific daemon
ceph config show client.rgw.<daemon-id> | grep -E 'rgw_enable_lc_threads|rgw_lc_max_worker|rgw_lc_max_wp_worker'
# Check configuration for all RGW daemons
ceph config dump | grep -E 'rgw_enable_lc_threads|rgw_lc_max_worker|rgw_lc_max_wp_worker'
Performance Considerations:
- Limiting worker counts to 1 may have a performance impact in some situations
- This is a diagnostic step to determine if lifecycle worker concurrency is the root cause
- If the segmentation faults stop occurring with limited concurrency, it confirms the race condition theory
Potentially Related Fix
The following PR may address this issue: https://github.com/ceph/ceph/pull/56712
This fix should be evaluated to determine if it resolves the segmentation fault related to concurrent attribute updates.
Additional Context
- Request showing successful operation just before crash:
req=0x7f9b89be1710completed withop status=0 http_status=200 - Client:
xx.xxx.xxx.xxxperforming operations on bucketxxx-prod-xxxx-fs01 - The combination of multi-object delete operations, garbage collection errors, OLH update failures, and lifecycle worker thread activity suggests a race condition during concurrent attribute updates on versioned objects