MDS Client Warnings and Session Deadlocks
Problem
In some CephFS environments, a particular sequence of client operations may lead to a deadlocked client session and several long-lasting warnings on the MDS. Related warnings include MDS_SLOW_REQUEST, MDS_CLIENT_OLDEST_TID, and MDS_CLIENT_RECALL. ceph health detail
is normally used to view the exact warnings and relevant clients.
Solution
While the exact root of the deadlock is not yet understood, these deadlocks may be resolved by cleanly unmounting and re-mounting CephFS on the relevant client. If this is not possible, then you may evict the relevant client.
- First it is important to understand which client is causing the deadlocked operations. Normally the relevant client
id
is displayed inceph health detail
, and theid
can be confirmed by checking the outstandingops
on the relevant MDS. For example, if the health warning is generated bymds.0
, use:
# ceph tell mds.0 ops | less
This will output a JSON structure with the oldest client operation shown first. Confirm that the age
of the oldest operation is many hours. Note down the id
of the relevant session, e.g. if client.12345678
then the id is 12345678
.
- Next, you can view the details of the client session as follows:
# ceph tell mds.* client ls id=<id, e.g. 12345678>
Details such as hostname and mount_path can be used to debug further on the client side.
-
Umount and remount on the client side.
-
If needed, evict the client session as follows. First, ensure that clients are not blocklisted when evicted:
# ceph config set mds mds_session_blocklist_on_evict false
# ceph config set mds mds_session_blocklist_on_timeout false
Then evict the relevant client:
# ceph tell mds.* client evict id=<id, e.g. 12345678>