Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
Note
Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter
โจ Features: Native OpenMetrics ยท Multiple endpoints ยท Basic Auth & TLS ยท Global labels ยท YAML config ยท Clean Architecture
- โจ Features
- ๐ฆ Installation
- โ๏ธ Configuration (flags, collectors, Prometheus)
- ๐ Metrics Reference (all 14 collectors)
- ๐ ๏ธ Development (build, test, lint)
- ๐ Grafana Dashboards
- ๐ธ Screenshots
- ๐ License
- โ Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
- โ All metric collectors are optional and can be enabled/disabled via flags.
- โ Supports TLS and Basic Authentication for secure connections.
- โ OpenMetrics format supported (exemplars, newer Prometheus features).
- โ
Per-collector health metrics (
slurm_exporter_collector_success,slurm_exporter_collector_duration_seconds). - โ
Liveness probe at
/healthzfor orchestrators (Kubernetes, systemd). - โ
GPU metrics per account and user (
slurm_account_gpus_running,slurm_user_gpus_running). - โ
Per-reservation node state metrics (
slurm_reservation_nodes_*). - โ Ready-to-use Grafana dashboard.
There are two recommended ways to install the Slurm Exporter.
This is the easiest method for most users.
-
Download the latest release for your OS and architecture from the GitHub Releases page. ๐ฅ
-
Place the
slurm_exporterbinary in a suitable location on a node with Slurm CLI access, such as/usr/local/bin/. -
Ensure the binary is executable:
chmod +x /usr/local/bin/slurm_exporter
-
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
-
Copy it to
/etc/systemd/system/slurm_exporter.serviceand customize it for your environment (especially theExecStartpath). -
Reload the Systemd daemon, then enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable slurm_exporter sudo systemctl start slurm_exporter
-
If you want to build the exporter yourself, you can do so using the provided Makefile. ๐ฉโ๐ป
-
Clone the repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter -
Build the binary:
make build
-
The new binary will be available at
bin/slurm_exporter. You can then copy it to a location like/usr/local/bin/and set up the Systemd service as described in the section above.
Ten ready-to-use Grafana dashboards are provided in the dashboards_grafana/ directory.
All dashboards use a $datasource template variable and are compatible with Grafana 12+.
| Dashboard | UID | Description |
|---|---|---|
| Cluster Overview | slurm-overview |
Global cluster health: CPU/GPU utilization, node states, job totals, partition summary |
| Jobs & Queue | slurm-jobs |
Job queue details by user, account, partition โ pending reasons, top users |
| Node Detail | slurm-nodes |
Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes |
| Scheduler | slurm-scheduler |
slurmctld internals: cycle time, backfill, RPC statistics |
| Reservations & Licenses | slurm-reservations |
Active reservations, node states per reservation, license usage |
| Exporter Health | slurm-health |
Collector OK/FAIL status, scrape duration history, Slurm binary versions |
| Cluster Usage Statistics | slurm-usage |
CPU/GPU utilization gauges, fairshare per account, top users by CPU |
| All Metrics Reference | slurm-all-metrics |
Exhaustive reference panel for every exported metric |
| Accounting | slurm-accounting |
User/account consumption, FairShare analysis, top consumers, priority diagnostics |
| Exporter Performance | slurm-exporter-perf |
Command durations, cache freshness, error rates, scrape health (new in v1.8.0) |
Option 1 โ Copy JSON files to your Grafana provisioning directory:
cp dashboards_grafana/*.json /etc/grafana/provisioning/dashboards/Option 2 โ Import via API:
for f in dashboards_grafana/*.json; do
curl -s -X POST http://admin:password@grafana-host:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat $f), \"overwrite\": true, \"folderId\": 0}"
doneScale note (Node Detail dashboard): The per-node table is filtered by the
$partitionvariable. On clusters with 100k+ nodes, always select a specific partition to avoid loading excessive data. The partition summary and problem nodes panels are always scalable regardless of cluster size.
Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See
dashboards_grafana/README.mdfor the full dashboard documentation.
|
All 10 dashboards documented in |
||
This project is licensed under the GNU General Public License, version 3 or later.
This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).
Feel free to contribute or open issues!
