Monitoring Ceph health with Prometheus is straightforward since Ceph already exposes an endpoint with all of its metrics for Prometheus. In this article, we will put it all together to help you start monitoring your Ceph storage cluster and guide you through all the important metrics.
Ceph offers a great solution for object-based storage to manage large amounts of data even on economical hardware. Besides, the Ceph Foundation is organized as a direct fund under the Linux Foundation.
Monitoring Ceph is crucial for maintaining the health of your disk provider, as well as keeping the cluster’s quorum.
How to enable Prometheus monitoring for Ceph
If you deployed Ceph with Rook, you won’t have to do anything else. Prometheus is already enabled and the pod is annotated, so Prometheus will gather the metrics automatically.
Otherwise, if you didn’t deploy Ceph with Rook, there are a couple of additional steps.
Enable Prometheus monitoring
Use this command to enable Prometheus in your Ceph storage cluster. It enables an endpoint returning Prometheus metrics.
ceph mgr module enable prometheus
Please note that after doing this, you’ll need to restart the Prometheus manager module to completely enable Prometheus.
Annotate Ceph pods with Prometheus metrics
Add these annotations to ceph-mgr
deployment so Prometheus service discovery can automatically detect your Ceph metrics endpoint.
annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9283'
Monitoring Ceph health
Ceph status
The absolute Top 1 metric you should check is ceph_health_status
. If this metric doesn’t exist or it returns something different from 1
, the cluster is having critical issues.
Let’s create an alert to be aware of this situation:
absent(ceph_health_status == 1)
Cluster remaining storage
As in all systems where you use disks, you need to check the remaining available storage. To check this, you can use ceph_cluster_total_bytes
to get the total disk capacity (in bytes) and ceph_cluster_total_used_bytes
to get the disk usage (in bytes).
Let’s create a PromQL query to alert when the space left is under 15% of the total disk space:
(ceph_cluster_total_bytes-ceph_cluster_total_used_bytes)/ceph_cluster_total_bytes < 0.15
Object Storage Daemon nodes down
Object Storage Daemon (OSD) is responsible for storing objects on a local file system and providing access to them over the network. There’s an OSD in each node. If an OSD goes down, you won’t have access to the physical disks mounted on that node.
Let’s create an alert as if there’s an OSD down:
ceph_osd_up == 0
Missing MDS replicas
It’s important to check that the actual number of MDS replicas isn’t lower than expected. Usually, for high availability (HA), the number is three. But in larger clusters, it can be higher.
ceph-mds
is the metadata server daemon for the Ceph distributed file system. It coordinates access to the shared OSD cluster. If MDS is down, you won’t have access to the OSD cluster.
This PromQL query will alert you if there’s no MDS available.
count(ceph_mds_metadata == 1) == 0
Quorum
In case the Ceph MONs cannot form a quorum, cephadm
is unable to manage the cluster until the quorum is restored. Learn more about how Ceph uses Paxos to establish consensus about the master cluster map in the Ceph documentation.
It’s recommended to have three monitors to get a quorum. If any is down, then the quorum is at risk.
This can be alerted with the ceph_mon_quorum_status
metric:
count(ceph_mon_quorum_status{%s} == 1) <= ((count(ceph_mon_metadata{%s}) %s 2) + 1)
Want to dig deeper into PromQL? Download our PromQL cheatsheet!
Add these metrics to Grafana or Sysdig Monitor in a few clicks
In this article, we’ve learned how monitoring Ceph health with Prometheus can easily help you check your Ceph cluster health, and identified the top five key metrics you need to look at.
In PromCat.io, you can find a dashboard and the alerts showcased in this article, ready to use in Grafana or Sysdig Monitor. These integrations are curated, tested, and maintained by Sysdig.
Also, learn how easy it is to monitor Ceph with Sysdig Monitor.
If you would like to try this integration, we invite you to sign up for a free trial of Sysdig Monitor.