Emergency Support Request: OSD Crash on Ceph v20.2.0

Problem

On January 11, 2026, at 2:48 PST, an emergency support request was initiated regarding OSD crashes that rendered CephFS inaccessible in v20.2.0.

The incident was resolved, restoring cluster availability.

A secondary post-recovery issue was subsequently identified and fixed.

Solution

The fix involved deploying a new build of Ceph that contained the patches for the fix. The engineering team made use of the new build system, making it possible to get the fix to the client as fast as possible.

What happened

The client enabled allow_ec_optimizations on an existing CephFS Erasure Coded (EC) data pool. Crucially, the pool did not have allow_ec_overwrites explicitly enabled at the time the optimization flag was set.

The new code path (specifically within ECTransaction::WritePlanObj) attempted to access a transaction key that did not exist.

Specifically https://github.com/ceph/ceph/pull/66543/files

Since OSDs were crashing and it didn’t seem possible to bring them back up, two options were identified by the engineering team:

Disable writes to the pool and access it in r/o mode (possibly migrating to a new pool)
Make a Ceph build with the fixed patch.

The team is always cautious about making a Ceph build with a fixed patch since it doesn’t undergo the same extensive testing as upstream does during their releases.

In this scenario, since the fix was already merged into upstream and our engineers understood the impact of the patch, we decided to exercise our build system to deliver a fix for this client.

At January 11, 5:07AM the client agreed with this solution and at 6:45AM the image was delivered to the client. The client replaced an OSD with the new image and it stopped the crash; after replacing all OSDs with the new patch all OSDs were up and PGs were restored.

Tracker: https://tracker.ceph.com/issues/71642
Pull Request: https://github.com/ceph/ceph/pull/63960

New Scrub Errors

cluster:
    id:     c4f43f6e-f5c0-11ef-af25-a036bccd8b79
    health: HEALTH_WARN
            noout,noscrub,nodeep-scrub flag(s) set
            Too many repaired reads on 32 OSDs
            (muted: OSD_SCRUB_ERRORS POOL_NO_REDUNDANCY)

After the administrators removed the noscrub and nodeep-scrub flags, the OSD began flooding with scrub-related errors:

osd.20 [ERR] 4.115 shard 4(1) soid 4:a8a7e711:::100027683d7.00000001:head : candidate size 1089536 info size 1085440 mismatch

Related Tracker: https://tracker.ceph.com/issues/73184
Pull Request: https://github.com/ceph/ceph/pull/65872
Related Tracker: https://tracker.ceph.com/issues/73260
Pull Request: https://github.com/ceph/ceph/pull/65788

FIXME: Add some explanation about the issue

The client then confirmed that everything was fine on the new build. There were no more CRC errors in logs, and write IOs went smoothly without any issues.

Second image: harbor.clyso.com/custom-ceph/ceph/ceph:v20.2.0-fast-ec-path-hf.2 This image includes all three mentioned PRs.

Timeline

Date	Time	Event
2026 Jan 11	2:48 AM PST	Clyso engineering team makes first contact with the issue
2026 Jan 11	4:38 AM PST	Clyso engineering team identified the possible issue (see tracker) and proposed two solutions to either migrate the data over to a new pool or compile the fix
2026 Jan 11	5:00 AM PST	Client agrees with a compiled fix and build team decides to begin a build.
2026 11 Jan	7:34 AM PST	Build is delivered to the client.
2026 11 Jan	9:14 AM PST	All OSDs in the client's cluster are now up. PGs are coming back.
2026 11 Jan	9:42 AM PST	Post-Incident Observation. Customer reports a high count of `OSD_SCRUB_ERROR`s (approximately 592,000 errors) following recovery.
2026 12 Jan	3:43 AM PST	Cluster reports `HEALTH_WARN` due to `too many repaired reads on 32 OSDs`. Whenever an OSD would undergo a scrub, the following error was logged: `osd.20 [ERR] 4.115 shard 4(1) soid ... candidate size 1089536 info size 1085440 mismatch` `osd.20 [ERR] 4.115s0 shard 4(1) soid ... size 1089536 != size 1085440 from auth oi`
2026 12 Jan	6:43 AM PST	Clyso identifies two missing patches that are thought to contain the fix.
2026 12 Jan	9:36 AM PST	The build is produced and sent to the client.
2026 12 Jan	11:52 AM PST	The client reports that OSDs are not triggering any new errors. Write IO is working again.

Problem​

Solution​

What happened​

New Scrub Errors​

Timeline​

Problem

Solution

What happened

New Scrub Errors

Timeline