CephFS: No space left on device
Problem
The error message No space left on device appears.
-
Attempt to remove a file called
test1:[root@mds ~]# rm test1
rm: remove regular empty file ‘test1’? y
rm: cannot remove ‘test1’: No space left on device -
Run the command
df -h, which shows us the actual filesystem consumption:[root@mds ~]# df -h /cephfs/
Filesystem Size Used Avail Use% Mounted on
ceph-fuse 115T 87T 28T 76% /cephfs
Explanation
The mds_bal_fragment_size_max sets a hard limit on a directory's fragment
size. When this limit is reached for a directory fragment and CephFS tries to
add a new entry it fails with a No space left on device message. This is
usually not observed for ordinary directories, because the fragment size is
controlled by the mds_bal_split_size soft limit, which still may be exceeded
due to delayed splitting. Because it is ten times smaller then the hard limit
mds_bal_fragment_size_max (by default) it is unlikely to be reached
although it is still possible, and one needs to check for this possibility too.
Note that the error will appear when adding a file (not when removing it).
A more likely scenario is when mds_bal_fragment_size_max is reached for a
special "stray" directory. This directory keeps entries of deleted (unlinked)
files. When a file is deleted, its entry is unlinked from its directory and
added to the stray directory. Later, when data for this file is deleted, the
entry is removed from the stray directory. This however can be delayed if the
file is still in use (e.g. it is is still opened by a client or remains in the
MDS cache) or if there was a bulk deletion of many files, which can take some
time to process. So the size of the stray directory may grow (and this
correlates with having a large MDS cache) and as it never dynamically splits,
it may reach the mds_bal_fragment_size_max limit. In that case, a
file-removing operation will fail with the No space left on device error
(when CephFS tries to add the removing entry to the stray directory).
This case can be confirmed by checking the mds num_strays counter, for
example by running the command ceph tell mds.X perf dump |grep strays. If it
shows that num_strays is close to 10 times mds_bal_fragment_size_max
values, the the limit is reached.
You can try to decrease the number of strays by forcing the MDS to free deleted
inodes from its cache (for example, ls -R /cephfs might help, or running the
mds scrub command, or temporarily reducing the MDS cache size), although if
it does not help or helps only temporarily and you need a permanent working
solution, then the workaround is to increase the hard limit
mds_bal_fragment_size_max. But be careful when increasing this value, because
it may lead to oversized directory fragment objects in the metadata pool, which
the OSDs may not be able to handle. We therefore recommend increasing it
gradually, for example increasing it two times (2x) on the first step, checking
if it is enough (num_strays does not reach the new limit) and increasing it
more by repeating the process only if the previous step did not fix the
problem.
The default value of mds_bal_fragment_size_max is 100,000, meaning the error
will occur when num_strays approaches 1,000,000. If mds_cache_memory_limit
has been increased, mds_bal_fragment_size_max should be increased accordingly
to raise this threshold.
For more on mds_bal_fragment_size_max, see the "Size Thresholds" section of the "Configuring Directory Fragmentation" page of the upstream Ceph documentation.
Diagnosing the issue
-
Run the following commands on the active MDS node to check the relevant values:
ceph tell mds.{id} config get mds_bal_fragment_size_max -
Check the current
num_straysvalueceph tell mds.{id} perf dump | grep straysIf
num_straysis approaching 10x the value ofmds_bal_fragment_size_max, the issue is likely to occur or is already occurring. For example, with the defaultmds_bal_fragment_size_maxof 100,000, anum_straysvalue approaching 1,000,000 indicates the threshold is being reached.
Solution
Double mds_bal_fragment_size_max to 200,000 on the active MDS node:
ceph config set mds mds_bal_fragment_size_max 200000
Monitor num_strays to ensure that it stays below 2,000,000 (10x the new
value). If it again approaches the limit, further increase
mds_bal_fragment_size_max in increments of 100,000. Be careful when
increasing this value, because increasing it can lead to oversized directory
fragment objects in the metadata pool that the OSDs might not be able to
handle.
Remember that if mds_cache_memory_limit has been increased,
mds_bal_fragment_size_max may need to be increased accordingly to raise this
threshold.