CephFS MDS full, cluster now read-only
Problem
A metadata pool in CephFS ballooned from 50GB to 4.7TB in eight hours. This filled up the SSD OSDs.
Solution
- Stop all active MDSs.
- Stop the topmost 'reccaps' client.
- Delete the test pool to reclaim space.
- Restart the 3 MDSs.
Discussion
In a test case, 3 active MDSs (ranks 0,1,2) were in the affected cluster. Many 'Updating MDS map to version XXXXXX' messages had caused the counter to increase when slow requests happened (as OSDs were filling up). It was determined that the OSDs were full but that they could still either (1) delete a test pool and free up 141GB, or (2) move the CephFS metadata pool to other OSDs.
To find the client with the most "reccaps", run the following command:
# ceph tell mds.x session ls > session_list.json
# ./clyso-cephfs-session-top -f session_list.txt
Client Sessions: 846
LOADAVG NUMCAPS RECCAPS RELCAPS LIVENESS CAPACQU CLIENT
...
0 4 0 0 0 0 909346842 name_of_client_1
0 5680 481429 0 0 0 909452623 name_of_client_2
0 8 0 0 0 0 909452626 name_of_client_3
...