Skip to main content

CephFS

CES and Ceph KB articles related to CephFS, covering the MDS.


MDS - Prevent MDS out-of-memory shortly after restart

Problem

Customer reports that the MDS is using much more memory than configured, and even goes OOM occasionally causing a service disruption.

Solution

Increase the mds_cache_trim_threshold option from a default 64k to 512k:

# ceph config set mds mds_cache_trim_threshold 524288

Discussion

The MDS maintains its LRU cache size by periodically trimming entries. It trims up to mds_cache_trim_threshold entries per tick. With the default setting of 64kB, a single highly active client can easily hammer the MDS and force it to increase its cache more quickly than it can be trimmed. By increasing this option to 512k, it will trim the LRU more actively, keeping the cache size under the configured limit.


MDS - CephFS Clean Power Off Procedure

Problem

A customer would like to cleanly power off their CephFS cluster before a power or network intervention.

Solution

Use the following steps to cleanly unmount and switch off a CephFS cluster:

  1. If possible, the customer should umount CephFS from all clients, so that all dirty pages are flushed.
  2. Prepare the ceph cluster:
# ceph osd set noout
# ceph osd set noin
  1. Wait until there is zero IO on the cluster, notify any leftover clients that they need to umount.
  2. Mark the CephFS down with:
# ceph fs set cephfs down true # “cephfs” is the name of the filesystem
  1. Stop all the ceph-osd's. (It is okay to skip this step if the servers will be cleanly powered off)
  2. Power off the servers.
  3. Power on the cluster.
  4. Wait for osds/mds to boot and all PGs active.
  5. Mark the CephFS back online:
# ceph fs set cephfs down false
  1. Reconnect and test clients.
  2. Remove the flags set in Step 2:
# ceph osd unset noout
# ceph osd unset noin

MDS - Unable to decode FSMap during Pacific Upgrade

Problem

While upgrading from Nautilus (v14) to Pacific (v16), a user reports that the new v16 mon will not start, crashing with this error:

unable to decode FSMap: void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer understand old encoding version v < 7: Malformed input

The cluster does not currently have any MDS running or CephFS configured, but may have had a CephFS configured in the past.

Solution

The mon database likely contains old incompatible fsmap data which is not readable by v16 ceph-mon daemons. These must be cleaned up prior to upgrading to Pacific.

  1. First, downgrade the crashing mon back to Nautilus, and confirm that all MONs are running v14 and have quorum.
  2. Use these commands to temporarily create then remove a CephFS:
# ceph osd pool create data 32 replicated
# ceph osd pool create meta 32 replicated
# ceph fs new cephfs meta data
# ceph fs fail cephfs
# ceph fs rm cephfs --yes-i-really-mean-it
# ceph config set global mon_allow_pool_delete true
# ceph osd pool rm data data --yes-i-really-really-mean-it
# ceph osd pool rm meta meta --yes-i-really-really-mean-it
# ceph config set global mon_allow_pool_delete false
  1. Next trim out the old incompatible fsmap objects as follows:
# ceph fs dump
e4 <--- use what ever number you will get
...
# echo epoch is 4
epoch is 4
# ceph config set mon mon_mds_force_trim_to 3 # one less than 4 <---- use what ever number you will get (X-1)
# ceph config set mon paxos_service_trim_min 1
# ceph fs dump 2 # repeat until you can verify cannot access e-2
Error ENOENT: <---- epoch eX has been trimmed and hence it is not reachable
# ceph config rm mon mon_mds_force_trim_to
# ceph config rm mon paxos_service_trim_min

MDS - MDS_CLIENT_RECALL, MDS_SLOW_REQUEST, MDS_CLIENT_OLDEST_TID Warnings Lasting for Many Hours

Problem

In some CephFS environments, a particular sequence of client operations may lead to a deadlocked client session and several long-lasting warnings on the MDS. Related warnings include MDS_SLOW_REQUEST, MDS_CLIENT_OLDEST_TID, and MDS_CLIENT_RECALL. ceph health detail is normally used to view the exact warnings and relevant clients.

Solution

While the exact root of the deadlock is not yet understood, these deadlocks may be resolved by cleanly unmounting and re-mounting CephFS on the relevant client. If this is not possible, then you may evict the relevant client.

  1. First it is important to understand which client is causing the deadlocked operations. Normally the relevant client id is displayed in ceph health detail, and the id can be confirmed by checking the outstanding ops on the relevant MDS. For example, if the health warning is generated by mds.0, use:
# ceph tell mds.0 ops | less

This will output a JSON structure with the oldest client operation shown first. Confirm that the age of the oldest operation is many hours. Note down the id of the relevant session, e.g. if client.12345678 then the id is 12345678.

  1. Next, you can view the details of the client session as follows:
# ceph tell mds.* client ls id=<id, e.g. 12345678>

Details such as hostname and mount_path can be used to debug further on the client side.

  1. Umount and remount on the client side.

  2. If needed, evict the client session as follows. First, ensure that clients are not blocklisted when evicted:

# ceph config set mds mds_session_blocklist_on_evict false
# ceph config set mds mds_session_blocklist_on_timeout false

Then evict the relevant client:

# ceph tell mds.* client evict id=<id, e.g. 12345678>

MDS - When and How to Enable Client Auto Eviction?

Problem

In some customer environments, an unknown workload is regularly causing MDS_CLIENT_RECALL, MDS_SLOW_REQUEST, MDS_CLIENT_OLDEST_TID warnings. Operators are using the client "eviction" procedure above as a workaround and would like some automation.

Solution

Automatic client eviction should only be used sparingly, after the following conditions have been satisfied:

  1. The CephFS cluster is seeing MDS_CLIENT_RECALL warnings lasting many hours, with MDS_SLOW_REQUEST ops also lasting many hours.
  2. Manual client eviction is confirmed to resolve the MDS_SLOW_REQUEST warnings fully.
  3. Manual client eviction is confirmed with the client/user to not have an adverse impact on their workload or data consistency.

If all of the above are true, then you may configure automatic client eviction, e.g. after 15 minutes of blocked caps eviction:

# ceph config set mds mds_session_blocklist_on_evict false
# ceph config set mds mds_session_blocklist_on_timeout false
# ceph config set mds mds_cap_revoke_eviction_timeout 900

CephFS Pool Data Usage Growth Without Explanation

Problem

The CephFS data pool usage is increasing, even though users are deleting their CephFS files.

Solution

Deleted files are added to a Purge Queue which are processed sequentially. If files are deleted by users more quickly than the purge queue can be processed, the data pool usage will increase over time.

Internally the MDS has a few options to throttle the processing of the purge queue:

  • mds_max_purge_ops (default 8192)
  • mds_max_purge_ops_per_pg (default 0.5)
  • filer_max_purge_ops (default 10)

The defaults for the mds_max_purge_ops related options are normally good. The default filer_max_purge_ops (10) is too small for CephFS file systems holding very large files (e.g. 1TB+).

Increase filer_max_purge_ops to 40 so that space can be freed up more quickly.

Discussion

Internally the MDS records the status of the Purge Queue in perf counters which can be queried using perf dump:

{
"pq_executing_ops": 44814,
"pq_executing_ops_high_water": 524321,
"pq_executing": 1,
"pq_executing_high_water": 64,
"pq_executed": 93799,
"pq_item_in_journal": 40967
}

After setting filer_max_purge_ops to 40, the Purge Queue clears out:

{
"pq_executing_ops": 0,
"pq_executing_ops_high_water": 524321,
"pq_executing": 0,
"pq_executing_high_water": 64,
"pq_executed": 133469,
"pq_item_in_journal": 0
}

In the above example, there are 40967 files to be removed, 1 file is currently being removed, and it has 44814 RADOS objects to be removed. With the default configuration, only 10 objects will be removed at once, so the queue of pg_items_in_journal will continue to grow, leading to unbound space usage.