Post Mortem: Tentacle v20.2.0 OSD crashing due to EC Bug

January 13, 2026 · 6 min read

Software Engineer at Clyso

Zac Dover

Technical Writer at Clyso

On January 11, 2026, at 2:48 PST, an emergency support request was opened for OSD crashes in v20.2.0 that rendered CephFS inaccessible.

The incident was resolved, restoring cluster availability.

A secondary post-recovery issue related to scrubbing errors was subsequently identified and fixed.

The fix involved deploying a new build of Ceph that contained the patches for the bugs. The engineering team made use of Clyso's new build system, delivering the fix to the client as fast as possible.

Initial Status

9 crashed OSDs (over 42) in multiple failure domains, 102 inactive PGs, 1 inactive CephFS filesystem:

  cluster:
    id:     <redacted>
    health: HEALTH_ERR
            9 failed cephadm daemon(s)
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            7/172378665 objects unfound (0.000%)
            noout,nobackfill,norebalance,norecover flag(s) set
            9 osds down
            340449 scrub errors
            Reduced data availability: 102 pgs inactive, 86 pgs down, 17 pgs stale
            Possible data damage: 27 pgs inconsistent
            Degraded data redundancy: 59884881/517135781 objects degraded (11.580%), 393 pgs degraded, 430 pgs undersized
             
  services:
    mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 15h) [leader: ceph-1]
    mgr: ceph-2.ghijk(active, since 15h), standbys: ceph-1.abcdef
    mds: 1/1 daemons up, 1 standby
    osd: 42 osds: 33 up (since 14h), 42 in (since 18h); 111 remapped pgs
         flags noout,nobackfill,norebalance,norecover
          
  data:
    volumes: 0/1 healthy, 1 stopped

What happened

The client upgraded from squid to tentacle and enabled allow_ec_optimizations on an existing CephFS Erasure Coded (EC) data pool.

The new code path (specifically within ECTransaction::WritePlanObj) attempted to access a transaction key that did not exist and caused random OSDs to crash.

Specifically: Github Code Link

Due to persistent OSD crashes and failed recovery attempts, the engineering team evaluated two options:

Disable writes to the pool and access it in r/o mode (possibly migrating data to a new pool)
Build a Ceph release that fixed the bug that led to the crashing OSDs

The team is always cautious about making a Ceph build with a fixed patch since it doesn’t undergo the same extensive testing as upstream does during their releases.

In this scenario, since the fix was already merged into upstream and our engineers understood the impact of the patch, we decided to exercise our build system to deliver a fix for this client.

At January 11, 5:07AM the client agreed with this solution and at 6:45AM the image was delivered to the client. The client applied the new image to a single OSD that started and operated without crashing; after replacing all OSDs image with the new patch all OSDs were up and running and all PGs were recovered.

Tracker: https://tracker.ceph.com/issues/74128
Pull Request: https://github.com/ceph/ceph/pull/66542

Intermediate Status

All OSDs up and running, PGs recovered but lots of scrubbing errors

  cluster:
    id:     <redacted>
    health: HEALTH_WARN
            noout,noscrub,nodeep-scrub flag(s) set
            Too many repaired reads on 32 OSDs
            (muted: OSD_SCRUB_ERRORS)
 
  services:
    mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 39h) [leader: ceph-1]
    mgr: ceph-2.ghijk(active, since 40h), standbys: ceph-1.abcdef
    mds: 1/1 daemons up, 1 standby
    osd: 42 osds: 42 up (since 19h), 42 in (since 43h)
         flags noout,noscrub,nodeep-scrub
 
  data:
    volumes: 1/1 healthy

After the administrators removed the noscrub and nodeep-scrub flags, the OSD began flooding with scrub-related errors:

osd.20 [ERR] 4.115 shard 4(1) soid 4:a8a7e711:::100027683d7.00000001:head : candidate size 1089536 info size 1085440 mismatch

As quoted from PR #65872:

The problem arises with "fast EC" for a pool which has been upgraded from a legacy EC pool. Any object that was written with legacy EC will be padded according to the legacy EC code, but scrubbing will check against the fast EC code, which pads shards more efficiently.

Clyso Engineers identified the following PRs as the missing fix for this scrubbing issue that were also merged by upstream.

Tracker	Pull Request
#73184	PR #65872
#73260	PR #65788

The client then confirmed that everything was fine on the new build. There were no more CRC errors in logs, and write IOs went smoothly without any issues.

Post-Recovery Status

  cluster:
    id:     <redacted>
    health: HEALTH_OK
            (muted: MDS_CLIENT_RECALL)
 
  services:
    mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 2d) [leader: ceph-1]
    mgr: ceph-2.ghijk(active, since 2d), standbys: ceph-1.abcdef
    mds: 1/1 daemons up, 1 standby
    osd: 42 osds: 42 up (since 13h), 42 in (since 2d)
 
  data:
    volumes: 1/1 healthy

Final image: harbor.clyso.com/custom-ceph/ceph/ceph:v20.2.0-fast-ec-path-hf.2 This image contains v20.2.0 + PR #66542 + PR #65872 + PR #65788

This image comes without any warranty or guarantee. For more information please contact Clyso Support.

Timeline

Date	Time	Event
2026 Jan 11	2:48 AM PST	Clyso engineering team makes first contact with the issue
2026 Jan 11	4:38 AM PST	Clyso engineering team identified the possible issue (see tracker) and proposed two solutions to either migrate the data over to a new pool or compile the fix
2026 Jan 11	5:00 AM PST	Client agrees with a compiled fix and build team decides to begin a build.
2026 11 Jan	7:34 AM PST	Build is delivered to the client.
2026 11 Jan	9:14 AM PST	All OSDs in the client's cluster are now up. PGs are coming back.
2026 11 Jan	9:42 AM PST	Post-Incident Observation. Client reports a high count of `OSD_SCRUB_ERROR`s (approximately 592,000 errors) following recovery.
2026 12 Jan	3:43 AM PST	Cluster reports `HEALTH_WARN` due to `too many repaired reads on 32 OSDs`. Whenever an OSD would undergo a scrub, the following error was logged: `osd.20 [ERR] 4.115 shard 4(1) soid ... candidate size 1089536 info size 1085440 mismatch` `osd.20 [ERR] 4.115s0 shard 4(1) soid ... size 1089536 != size 1085440 from auth oi`
2026 12 Jan	6:43 AM PST	Clyso identifies two missing patches that are thought to contain the fix.
2026 12 Jan	9:36 AM PST	The build is produced and sent to the client.
2026 12 Jan	11:52 AM PST	The client reports that OSDs are not triggering any new errors. Write IO is working again.

Initial Status​

What happened​

Intermediate Status​

Post-Recovery Status

Timeline​

Initial Status

What happened

Intermediate Status

Timeline