Skip to content

SckyzO/slurm_exporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

336 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Prometheus Slurm Exporter ๐Ÿš€

Release Create Dev Release GitHub release (latest by date) Go Report Card License: GPL v3

๐Ÿ“ธ View Dashboard Screenshots

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

Note

Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter

โœจ Features: Native OpenMetrics ยท Multiple endpoints ยท Basic Auth & TLS ยท Global labels ยท YAML config ยท Clean Architecture

๐Ÿ“‹ Table of Contents

โœจ Features

  • โœ… Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
  • โœ… All metric collectors are optional and can be enabled/disabled via flags.
  • โœ… Supports TLS and Basic Authentication for secure connections.
  • โœ… OpenMetrics format supported (exemplars, newer Prometheus features).
  • โœ… Per-collector health metrics (slurm_exporter_collector_success, slurm_exporter_collector_duration_seconds).
  • โœ… Liveness probe at /healthz for orchestrators (Kubernetes, systemd).
  • โœ… GPU metrics per account and user (slurm_account_gpus_running, slurm_user_gpus_running).
  • โœ… Per-reservation node state metrics (slurm_reservation_nodes_*).
  • โœ… Ready-to-use Grafana dashboard.

๐Ÿ“ฆ Installation

There are two recommended ways to install the Slurm Exporter.

1. From Pre-compiled Releases

This is the easiest method for most users.

  1. Download the latest release for your OS and architecture from the GitHub Releases page. ๐Ÿ“ฅ

  2. Place the slurm_exporter binary in a suitable location on a node with Slurm CLI access, such as /usr/local/bin/.

  3. Ensure the binary is executable:

    chmod +x /usr/local/bin/slurm_exporter
  4. (Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.

    • Copy it to /etc/systemd/system/slurm_exporter.service and customize it for your environment (especially the ExecStart path).

    • Reload the Systemd daemon, then enable and start the service:

      sudo systemctl daemon-reload
      sudo systemctl enable slurm_exporter
      sudo systemctl start slurm_exporter

2. From Source

If you want to build the exporter yourself, you can do so using the provided Makefile. ๐Ÿ‘ฉโ€๐Ÿ’ป

  1. Clone the repository:

    git clone https://github.com/sckyzo/slurm_exporter.git
    cd slurm_exporter
  2. Build the binary:

    make build
  3. The new binary will be available at bin/slurm_exporter. You can then copy it to a location like /usr/local/bin/ and set up the Systemd service as described in the section above.


๐Ÿ“ˆ Grafana Dashboards

Ten ready-to-use Grafana dashboards are provided in the dashboards_grafana/ directory. All dashboards use a $datasource template variable and are compatible with Grafana 12+.

Dashboard UID Description
Cluster Overview slurm-overview Global cluster health: CPU/GPU utilization, node states, job totals, partition summary
Jobs & Queue slurm-jobs Job queue details by user, account, partition โ€” pending reasons, top users
Node Detail slurm-nodes Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes
Scheduler slurm-scheduler slurmctld internals: cycle time, backfill, RPC statistics
Reservations & Licenses slurm-reservations Active reservations, node states per reservation, license usage
Exporter Health slurm-health Collector OK/FAIL status, scrape duration history, Slurm binary versions
Cluster Usage Statistics slurm-usage CPU/GPU utilization gauges, fairshare per account, top users by CPU
All Metrics Reference slurm-all-metrics Exhaustive reference panel for every exported metric
Accounting slurm-accounting User/account consumption, FairShare analysis, top consumers, priority diagnostics
Exporter Performance slurm-exporter-perf Command durations, cache freshness, error rates, scrape health (new in v1.8.0)

Import to Grafana

Option 1 โ€” Copy JSON files to your Grafana provisioning directory:

cp dashboards_grafana/*.json /etc/grafana/provisioning/dashboards/

Option 2 โ€” Import via API:

for f in dashboards_grafana/*.json; do
  curl -s -X POST http://admin:password@grafana-host:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d "{\"dashboard\": $(cat $f), \"overwrite\": true, \"folderId\": 0}"
done

Scale note (Node Detail dashboard): The per-node table is filtered by the $partition variable. On clusters with 100k+ nodes, always select a specific partition to avoid loading excessive data. The partition summary and problem nodes panels are always scalable regardless of cluster size.


๐Ÿ“ธ Screenshots

Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See dashboards_grafana/README.md for the full dashboard documentation.

Cluster Overview
Cluster Overview

Jobs & Queue
Jobs & Queue

Node Detail (scalable 100k+ nodes)
Node Detail

Cluster Usage Statistics
Cluster Usage Statistics

Scheduler
Scheduler

Exporter Health
Exporter Health

Reservations & Licenses
Reservations & Licenses

Accounting (new in v1.7.0)
Accounting

Exporter Performance (new in v1.8.0)
Exporter Performance

All 10 dashboards documented in dashboards_grafana/README.md

๐Ÿ“œ License

This project is licensed under the GNU General Public License, version 3 or later.

Buy Me a Coffee


๐Ÿด About this fork

This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).

Feel free to contribute or open issues!

About

Slurm Exporter is a Prometheus exporter designed to scrape and expose a comprehensive range of performance and scheduling metrics from Slurm-managed clusters. It supports both CPU and GPU resource accounting, node and partition state monitoring, job tracking, and scheduler statistics.

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors

โšก