Skip to main content

MDS Client Warnings and Session Deadlocks

Problem

In some CephFS environments, a particular sequence of client operations may lead to a deadlocked client session and several long-lasting warnings on the MDS. Related warnings include MDS_SLOW_REQUEST, MDS_CLIENT_OLDEST_TID, and MDS_CLIENT_RECALL. ceph health detail is normally used to view the exact warnings and relevant clients.

Solution

While the exact root of the deadlock is not yet understood, these deadlocks may be resolved by cleanly unmounting and re-mounting CephFS on the relevant client. If this is not possible, then you may evict the relevant client.

  1. First it is important to understand which client is causing the deadlocked operations. Normally the relevant client id is displayed in ceph health detail, and the id can be confirmed by checking the outstanding ops on the relevant MDS. For example, if the health warning is generated by mds.0, use:
# ceph tell mds.0 ops | less

This will output a JSON structure with the oldest client operation shown first. Confirm that the age of the oldest operation is many hours. Note down the id of the relevant session, e.g. if client.12345678 then the id is 12345678.

  1. Next, you can view the details of the client session as follows:
# ceph tell mds.* client ls id=<id, e.g. 12345678>

Details such as hostname and mount_path can be used to debug further on the client side.

  1. Umount and remount on the client side.

  2. If needed, evict the client session as follows. First, ensure that clients are not blocklisted when evicted:

# ceph config set mds mds_session_blocklist_on_evict false
# ceph config set mds mds_session_blocklist_on_timeout false

Then evict the relevant client:

# ceph tell mds.* client evict id=<id, e.g. 12345678>