Skip to main content

Containers

CES and Ceph KB articles related to container deployments, like ceph-csi and Rook.


Stale watchers prevent mapping RBD images with ceph-csi-rbd

Symptoms

  • Pods with RBD images cannot start because the images are reported to be still in use, even though there are no other apparent clients.
  • Listing watchers on the image includes clients that are not present anymore (e.g. the host is down).
  • These clients remain in watchers list even after being blocklisted.
# rbd status k8s/csi-vol-b0d7e424-cdda-41a5-950e-e5ff16dc0826
Watchers:
watcher=172.31.100.35:0/632487163 client.1391016703 cookie=140137199855360
# ceph osd blocklist add 172.31.100.35:0/632487163
# rbd status k8s/csi-vol-b0d7e424-cdda-41a5-950e-e5ff16dc0826
Watchers:
watcher=172.31.100.35:0/632487163 client.1391016703 cookie=140137199855360

172.31.100.35 is down, yet it still shows in the watchers list.

Problem

Affects:

  • Squid release ≤19.2.0
  • Reef release ≤18.2.4
  • Quincy release ≤17.2.7

The versions listed above are affected by a bug that causes clients that have exited uncleanly to remain in watchers list of the RBD image even after watcher timeout passes (osd_client_watch_timeout). This then makes it impossible to map the image in a new Pod because ceph-csi-rbd checks for active watchers. Additionally, blocklisting the client won't remove it from the watchers.

Please see the original bug report: [https://tracker.ceph.com/issues/58120]

Solution

Upgrade to Ceph version where this is already fixed. If that's not possible, the workaround is to restart the relevant primary OSDs:

# ceph osd map k8s csi-vol-b0d7e424-cdda-41a5-950e-e5ff16dc0826
osdmap e133 pool 'k8s' (9) object 'csi-vol-b0d7e424-cdda-41a5-950e-e5ff16dc0826' -> pg 9.de8252ef (9.f) -> up ([2,1,0], p2) acting ([2,1,0], p2)

In the example output above, we have identified OSD 2 as the primary. Restarting it can be done in many ways, for example like so:

# ceph osd ok-to-stop 2
# ceph osd down 2

After restarting the relevant OSDs, the stale watchers should be gone, and the Pod should start momentarily.