Restoring Connectivity between OSDs and Clients
Problem
Clients cannot connect to some OSDs, but connections to other OSDs and Monitors
are okay. The cluster seems functional and no health warnings are reported.
Many requests from clients succeed, but sometimes requests get stuck when
accessing the problematic OSDs. A typical case is that of RBD mounts: the
mount command succeeds, but further IO requests get stuck.
Such cases were observed after the OSD host was rebooted. After the OSD host's reboot, some OSDs were bound to incorrect interfaces or incorrect IP addresses.
Confirming the Case
A way to diagnose this issue is to run an RBD command that tries to access all
OSDs by using the --debug-ms=1 option, which will make RBD produce the output
necessary for debugging. Check the debug log to see if requests get stuck, and
if they do, note the OSDs that have those stuck requests. Run a command of the
following form to generate such a debug log:
rbd export {pool}/{image} /tmp/somefile --debug-ms=1
or
rbd bench --io-type read {pool}/{image} --debug-ms=1
The problematic OSDs might also be identified by reviewing the output of the following command:
ceph osd dump
Check the output of this command to determine if it contains incorrect IP addresses.
Solution
-
Check that the
public_networkandcluster_networksettings are correct and that they are set on thegloballevel (this means that the correct settings have been applied to all daemons and clients). -
Check that the network configuration on the OSD hosts are present and that both the
public_networkand thecluster_networkare present and that they have the correct mask. Use theip address listcommand to check for this. -
Check to make sure that the problematic OSDs are bound to the right addresses. Use the commands
ceph osd dump,ceph osd metadata, and use Linux network tools such asnetstatandss. -
Restart any problematic OSDs. Use the following command or, if you're not using cephadm to manage your cluster, use a command similar to this one that restarts the targeted OSDs:
ceph orch restart osd.<osd-id> -
If restarting the problematic OSDs doesn't help, further troubleshooting is needed.