Skip to main content

Restoring Connectivity between OSDs and Clients

Problem

Clients cannot connect to some OSDs, but connections to other OSDs and Monitors are okay. The cluster seems functional and no health warnings are reported. Many requests from clients succeed, but sometimes requests get stuck when accessing the problematic OSDs. A typical case is that of RBD mounts: the mount command succeeds, but further IO requests get stuck.

Such cases were observed after the OSD host was rebooted. After the OSD host's reboot, some OSDs were bound to incorrect interfaces or incorrect IP addresses.

Confirming the Case

A way to diagnose this issue is to run an RBD command that tries to access all OSDs by using the --debug-ms=1 option, which will make RBD produce the output necessary for debugging. Check the debug log to see if requests get stuck, and if they do, note the OSDs that have those stuck requests. Run a command of the following form to generate such a debug log:

rbd export {pool}/{image} /tmp/somefile --debug-ms=1

or

rbd bench --io-type read {pool}/{image} --debug-ms=1

The problematic OSDs might also be identified by reviewing the output of the following command:

ceph osd dump

Check the output of this command to determine if it contains incorrect IP addresses.

Solution

  1. Check that the public_network and cluster_network settings are correct and that they are set on the global level (this means that the correct settings have been applied to all daemons and clients).

  2. Check that the network configuration on the OSD hosts are present and that both the public_network and the cluster_network are present and that they have the correct mask. Use the ip address list command to check for this.

  3. Check to make sure that the problematic OSDs are bound to the right addresses. Use the commands ceph osd dump, ceph osd metadata, and use Linux network tools such as netstat and ss.

  4. Restart any problematic OSDs. Use the following command or, if you're not using cephadm to manage your cluster, use a command similar to this one that restarts the targeted OSDs:

    ceph orch restart osd.<osd-id>
  5. If restarting the problematic OSDs doesn't help, further troubleshooting is needed.