Ceph Time Synchronization with Chrony
Overview
Ceph clusters require precise time synchronization across all nodes, particularly Monitor (MON) nodes. Clock skew between monitors can prevent quorum formation, cause election storms, and lead to cluster instability. This guide explains how to configure Chrony for Ceph deployments.
Understanding NTP Stratum Architecture
Before diving into Ceph-specific configurations, it's important to understand how time synchronization works through the NTP stratum hierarchy. This hierarchical architecture ensures that accurate time flows from authoritative sources down through your network to every device.
What is Stratum?
Stratum is a measure of distance (in terms of synchronization hops) from the ultimate time source. The lower the stratum number, the closer the device is to an authoritative time reference.
Stratum Levels Explained
Stratum 0 - Reference Clocks
These are the most accurate timekeeping devices and represent the root of the entire time distribution tree:
- Atomic clocks: Cesium, rubidium, or hydrogen maser atomic clocks (accuracy: nanoseconds)
- GPS/GNSS receivers: Receive time signals from GPS, GLONASS, Galileo, or BeiDou satellites (accuracy: microseconds)
- Radio clocks: Receive time broadcasts from national time services like WWVB, DCF77, or MSF
Stratum 0 devices do not directly connect to computer networks. They provide time signals (typically via pulse-per-second or PPS) to Stratum 1 servers through dedicated hardware connections like serial ports.
Stratum 1 - Primary Time Servers
These are computers directly connected to Stratum 0 devices:
- Synchronized to within microseconds of UTC (Coordinated Universal Time)
- Act as the first network-accessible time sources
- Also called "primary time servers"
- Often peer with other Stratum 1 servers for sanity checking and backup
- Typical accuracy: 1-10 microseconds from UTC
Examples: time.nist.gov (NIST servers), GPS-equipped NTP servers in data centers
Stratum 2 - Secondary Time Servers
These are computers synchronized over a network to Stratum 1 servers:
- Query multiple Stratum 1 servers for redundancy
- Often peer with other Stratum 2 servers for stability
- Act as servers for Stratum 3 clients
- Typical accuracy: 1-10 milliseconds from UTC
- Most public NTP servers on the internet are Stratum 2
Examples: Public pool servers (pool.ntp.org), ISP-provided NTP servers
Stratum 3 through 15 - Client Servers
- Stratum 3 devices sync from Stratum 2 servers
- Each subsequent level adds one stratum number
- Most enterprise clients operate at Stratum 3 or 4
- Accuracy degrades slightly at each level due to network latency and jitter
- Typical accuracy at Stratum 3: 10-100 milliseconds from UTC
Stratum 16 - Unsynchronized
- Special value indicating a device is not synchronized
- Device has lost contact with all time sources
- Should never be used as a time source by other devices
- Chrony/NTP clients will reject Stratum 16 sources
How the Hierarchy Works
The NTP protocol uses this hierarchical structure to prevent timing loops and distribute load efficiently:
- Bellman-Ford algorithm: Each client constructs a shortest-path tree to minimize round-trip delay to Stratum 1 servers
- Reference ID tracking: Each server knows which upstream server it's synchronized to, preventing circular dependencies
- Load distribution: Higher stratum servers reduce load on primary time sources
Why Multiple Strata Instead of Everyone Using Stratum 1?
Load distribution: If every device on the internet queried Stratum 1 servers directly, these primary servers would be overwhelmed with requests and unable to function properly.
Efficient scaling: The hierarchical model allows thousands of Stratum 2 servers to serve millions of Stratum 3+ clients without overloading the limited number of Stratum 1 sources.
Network efficiency: Clients should use time sources close to them in network topology. A Stratum 3 server on your local network will give better accuracy than a distant Stratum 1 server due to lower network latency.
Cost considerations: Operating a Stratum 1 server requires expensive hardware (GPS receivers, atomic clocks) and is unnecessary for most use cases.
Stratum and Accuracy: Not Always Correlated
Important: Stratum number indicates distance from reference, not quality of time. A well-configured Stratum 3 server on your local network can provide more accurate time than a poorly-configured or distant Stratum 1 server.
Factors affecting accuracy regardless of stratum:
- Network latency: Variable delays between client and server
- Symmetric vs asymmetric paths: Different delays in each direction
- Server load: Overloaded servers respond inconsistently
- Clock stability: Quality of the local oscillator
- Peering relationships: Servers that peer can validate each other's time
Stratum in Ceph Deployments
For a typical Ceph cluster:
Stratum 1: Public time servers (time.nist.gov, pool.ntp.org)
↓
Stratum 2: Your Ceph Monitor nodes (synced to Stratum 1 + peered)
↓
Stratum 3: Your Ceph OSD/MDS/RGW nodes (synced to MON nodes)
↓
Stratum 4: Other infrastructure (synced to Ceph nodes)
For Ceph specifically:
- Monitor nodes should be Stratum 2 or 3 (synced to external sources)
- OSD/MDS/RGW nodes will be one stratum higher than MON nodes
- The exact stratum number matters less than maintaining tight synchronization between monitors
- Sub-millisecond accuracy is easily achievable at Stratum 3, well below Ceph's 50ms threshold
Verifying Your Stratum Level
Check your current stratum with Chrony:
# View stratum in tracking output
chronyc tracking | grep Stratum
# View stratum of your time sources
chronyc sources -v
Example output:
Reference ID : C0A80001 (ntp1.example.com)
Stratum : 3
This shows that the system is at Stratum 3, synchronized to a Stratum 2 source.
Why Ceph Demands Precise Time Synchronization
Ceph uses the Paxos consensus algorithm, which requires closely synchronized clocks:
- Default tolerance: Ceph warns when clock skew exceeds 50ms
(
mon_clock_drift_allowed = 0.05) - Monitor behavior: Monitors with excessive clock skew cannot reliably participate in quorum
- Critical operations affected: Monitor elections, client connections, and cluster state updates
- Check frequency: Ceph evaluates time synchronization every 5 minutes
Clock skew symptoms:
HEALTH_WARN clock skew detectedmessages- Monitors stuck in
probing,electing, orsynchronizingstates - Monitors failing to join quorum
- Client authentication failures
Why Chrony for Ceph?
Chrony is recommended over legacy ntpd for Ceph clusters:
- Better accuracy: Achieves sub-millisecond synchronization (well below Ceph's 50ms threshold)
- Faster convergence: Synchronizes clocks more quickly after system boot or network outages
- Handles network issues better: More resilient to intermittent connectivity
- Smooth time adjustments: Uses clock slewing instead of sudden jumps (critical for Ceph)
Important: Ceph does not tolerate sudden time jumps. Never use ntpdate or similar tools that set time abruptly.
Ceph Cluster Architecture for Time Sync
Basic Setup (Most Common)
External NTP sources (pool.ntp.org, etc.)
|
v
Ceph MON nodes (peer with each other)
|
v
OSD/MDS/RGW nodes (sync from MONs)
Recommended Configuration
-
All MON nodes:
- Sync to multiple external NTP sources
- Peer with each other (critical for Ceph)
- Act as NTP servers for other cluster nodes
-
OSD/MDS/RGW nodes:
- Sync to all MON nodes
- No need to peer with each other
-
Network consideration:
- Use local/internal NTP server if available
- Avoid single network path to external sources
Chrony Configuration for Ceph
Monitor Node Configuration
Configure MON nodes to sync externally and peer with each other:
# /etc/chrony/chrony.conf - Ceph Monitor Node
# External time sources - use multiple for redundancy
pool pool.ntp.org iburst
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
# Peer with other Ceph monitor nodes (CRITICAL for Ceph)
peer mon1.cluster.local
peer mon2.cluster.local
peer mon3.cluster.local
# Allow other cluster nodes to query this server
allow 10.0.0.0/8
# Drift file location
driftfile /var/lib/chrony/drift
# Sync system clock to hardware clock
rtcsync
# Allow stepping the clock initially, then only slew
# First number: step threshold in seconds (1.0 = 1 second)
# Second number: step limit (3 = allow stepping for first 3 updates)
makestep 1.0 3
# Log files
logdir /var/log/chrony
Critical notes:
- The
peerdirectives ensure MON nodes sync with each other - this is MORE important than syncing to external sources makestep 1.0 3allows initial time steps but switches to slewing afterwards- Never remove the peering between monitors
OSD/MDS/RGW Node Configuration
Configure non-monitor cluster nodes to sync from MONs:
# /etc/chrony/chrony.conf - Ceph OSD/MDS/RGW Node
# Use Ceph monitor nodes as time sources
server mon1.cluster.local iburst
server mon2.cluster.local iburst
server mon3.cluster.local iburst
# Drift file location
driftfile /var/lib/chrony/drift
# Sync system clock to hardware clock
rtcsync
# Allow stepping the clock initially, then only slew
makestep 1.0 3
Configuration with Local NTP Server
If you have a dedicated internal NTP server:
# /etc/chrony/chrony.conf - Ceph Monitor with local NTP
# Internal NTP server (primary source)
server ntp.company.local iburst prefer
# External sources (backup)
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
# Peer with other monitors
peer mon1.cluster.local
peer mon2.cluster.local
peer mon3.cluster.local
allow 10.0.0.0/8
driftfile /var/lib/chrony/drift
rtcsync
makestep 1.0 3
Deployment and Verification
Install and Enable Chrony
# Install (Debian/Ubuntu)
apt-get install chrony
# Install (RHEL/CentOS)
yum install chrony
# Enable and start
systemctl enable chronyd
systemctl start chronyd
# Verify service is running
systemctl status chronyd
Disable Conflicting Time Services
# Stop and disable systemd-timesyncd (conflicts with Chrony)
systemctl stop systemd-timesyncd
systemctl disable systemd-timesyncd
# Stop and disable ntpd if present
systemctl stop ntpd
systemctl disable ntpd
Verify Chrony Synchronization
# Check time sources
chronyc sources -v
# Check tracking status
chronyc tracking
# Check server statistics (on MON nodes)
chronyc serverstats
Expected output from chronyc tracking:
Reference ID : C0A80001 (mon1.cluster.local)
Stratum : 3
Ref time (UTC) : Mon Feb 10 15:45:32 2026
System time : 0.000001234 seconds fast of NTP time
Last offset : +0.000000987 seconds
RMS offset : 0.000002145 seconds
Frequency : 3.456 ppm slow
Residual freq : +0.001 ppm
Skew : 0.012 ppm
Root delay : 0.001234567 seconds
Root dispersion : 0.000123456 seconds
Update interval : 64.5 seconds
Leap status : Normal
Key indicators of good sync:
- System time offset < 1ms (0.001 seconds)
- Root dispersion < 10ms
- Stratum value reasonable for your setup
- Update interval regular (typically 64 seconds)
Verify Ceph Clock Synchronization
# Check Ceph cluster health
ceph -s
# Check time sync status (Ceph Luminous and later)
ceph time-sync-status
# Check for clock skew warnings
ceph health detail
Healthy output:
cluster:
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
health: HEALTH_OK
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d)
mgr: mon1(active, since 2d), standbys: mon2, mon3
osd: 12 osds: 12 up (since 2d), 12 in (since 2w)
Problem output:
cluster:
health: HEALTH_WARN
clock skew detected on mon.2, mon.3
mon.2 addr 10.0.1.2:6789/0 clock skew 0.085s > max 0.05s (latency 0.001s)
mon.3 addr 10.0.1.3:6789/0 clock skew 0.076s > max 0.05s (latency 0.001s)
Troubleshooting Clock Skew
Check Current Skew
# Ceph's perspective on time sync
ceph time-sync-status
# Chrony's perspective
chronyc sources -v
chronyc tracking
Force Immediate Sync
If clocks are significantly out of sync:
# Stop chronyd
systemctl stop chronyd
# Force sync (one-time step)
chronyd -q 'server pool.ntp.org iburst'
# Restart chronyd
systemctl start chronyd
Warning: Only do this when the cluster is in a degraded state. During normal operation, let Chrony gradually correct drift.
Restart Chrony on All Nodes
# Restart Chrony service
systemctl restart chronyd
# Wait 5-15 minutes for Ceph to re-evaluate sync
# Ceph checks time sync every 5 minutes
Common Issues and Solutions
Issue: Clock skew persists despite Chrony showing good sync
Solution:
- Verify that all MON nodes have identical Chrony configuration
- Check that MON nodes are peering with each other
- Ensure that there is no firewall blocking NTP (UDP port 123)
- Restart ceph-mon services on the affected nodes:
systemctl restart ceph-mon@<hostname>
# or
ceph orch daemon restart mon.<hostname>
Issue: Virtual machines show persistent clock skew
Solution:
- VM clocks tend to drift more than physical hardware
- Ensure that the VM host has the accurate time
- Enable VM guest time synchronization if available
- Consider running MON nodes on physical hardware
- Use
hpetclocksource instead oftsc:# Check current clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# Set to hpet
echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
Issue: High jitter or unstable sync
Solution:
- Use local/internal NTP server on same network as Ceph cluster
- Add more NTP sources for redundancy
- Check network connectivity and latency to NTP sources
- Reduce the polling interval (add
minpoll 4 maxpoll 7to server lines)
Adjust Clock Skew Tolerance (Not Recommended)
If absolutely necessary, increase Ceph's tolerance:
# Increase from default 0.05s to 0.1s (100ms)
ceph config set mon mon_clock_drift_allowed 0.1
# Check current value
ceph config get mon mon_clock_drift_allowed
Warning: Increase this only as a last resort. The default value exists to prevent serious cluster problems. Focus on fixing the underlying time sync issue in preference to doing this.
Best Practices for Ceph + Chrony
- MON node peering is mandatory: Always configure
peerbetween all MON nodes - Never use time jumps: Avoid
ntpdate,chronyd -qduring normal operation, ormakestepwithout limits - Use multiple sources: Configure at least 3 external NTP sources
- Local NTP preferred: Deploy an internal NTP server for better stability
- Monitor continuously: Set up alerts for
MON_CLOCK_SKEWwarnings - Physical hardware for MONs: Run monitors on bare metal when possible
- Consistent configuration: Use identical Chrony config across all MON nodes
- Wait after changes: Give Ceph 5-15 minutes to re-evaluate after fixing time sync
Monitoring Time Sync
Add monitoring for:
- Ceph health status (
ceph -s) - Chrony tracking offset (
chronyc tracking | grep "System time") - Clock skew warnings (
ceph health detail | grep skew)
Example monitoring script:
#!/bin/bash
# Check Ceph clock skew
OFFSET=$(chronyc tracking | grep "System time" | awk '{print $4}')
HEALTH=$(ceph health detail | grep -i skew)
if [ ! -z "$HEALTH" ]; then
echo "WARNING: Ceph clock skew detected"
echo "$HEALTH"
exit 1
fi
# Alert if offset > 10ms
if (( $(echo "$OFFSET > 0.010" | bc -l) )); then
echo "WARNING: Time offset ${OFFSET}s exceeds 10ms threshold"
exit 1
fi
echo "OK: Time sync healthy (offset: ${OFFSET}s)"
Summary
For Ceph clusters:
- Use Chrony instead of ntpd
- Configure MON nodes to peer with each other (critical)
- Never allow sudden time jumps
- Keep clock skew under 50ms (ideally < 10ms)
- Monitor continuously and address warnings immediately
- Use local NTP server when possible
Proper time synchronization is not optional for Ceph - it's a fundamental requirement for cluster stability.
See Also
ntp.org guidance on providing NTP services for huge networks Upstream Ceph documentation - mon_clock_skew