Ceph Version: Luminous 12.2.2 with Filestore and XFS
After more than 2 years, several OSDs on the productive Ceph cluster reported the error message:
** ERROR: osd init failed: (28) No space left on device
and terminated itself. Attempts to restart the OSD always ended with the same error message.
The Ceph cluster changed from HEALTH_OK to HEALTH_ERR status with the warning:
ceph osd near full
ceph pool near full
The superficial check with df -h sometimes showed 71% to 89% used disk space and no more files could be created in the file system.
No remount or unmount and mount has changed the situation.
The first suspicion was that the inode64 option for XFS might be missing, but this option was set. After closer examination of the internal statistics of the XFS file system with
xfs_db -r "-c freesp -s" /dev/sdd1
df -h
df -i
we chose the following solution:
First we stopped the recovery with
ceph osd set noout
so as not to fill the remaining OSDs any further. We then automatically distributed the data on the remaining Ceph cluster according to usage with
ceph osd reweight-by-utilization
We then moved a single PG (important: always different PGs per OSD) from the affected OSD to /root to have additional space on the file system and started the OSDs.
In the next step, we deleted virtual machine images that were no longer required from our cloud environment.
It took some time for the blocked requests to clear and the system to resume normal operation.
Unfortunately, it was not possible for us to definitively clarify the cause.
However, as we are currently in the process of switching from Filestore to Bluestore, we will soon no longer need XFS.