Feature Description & Motivation
The observability stack currently uses SckyzO/slurm_exporter v1.1.0 as the Prometheus exporter for Slurm metrics. The author has announced that this exporter will no longer be actively maintained starting with Slurm 25.11, because Slurm will natively integrate OpenMetrics at that point.
Why this matters now:
- HyperPod currently ships Slurm 24.11, so the existing exporter still works today, but it is already in maintenance-only mode with no new features or bug-fix guarantees.
- Slurm 25.11 will include native OpenMetrics support, making external exporters optional — but HyperPod is not yet on 25.11.
- The current install script (
observability/install_slurm_exporter.sh) builds the exporter from source, which requires installing the Go toolchain on every head node. This adds complexity, build time, and a transient dependency to the Lifecycle Configuration Script (LCS).
Related issues: #492 (stale), #644 (stale).
Category
Enhancement to existing test case
Alternatives Considered
| Exporter |
Status |
Slurm Req |
Notes |
rivosinc/prometheus-slurm-exporter |
Active (v1.8.0, Sep 2025, 65+ stars) |
Current Slurm |
Go binary with CLI fallback mode; no compiled Slurm plugins needed |
sckyzo/slurm_prometheus_exporter |
New (same author as old exporter) |
Slurm 25.11+ |
Uses native OpenMetrics endpoint, YAML config, MIT license |
| Slurm 25.11 native OpenMetrics |
Not yet available on HyperPod |
Slurm 25.11+ |
Built-in to slurmctld; no external exporter required |
Each option has different trade-offs around Slurm version compatibility, packaging complexity, and metric coverage. The right choice likely depends on the HyperPod Slurm upgrade timeline.
Additional Context
Affected files
Eight files across the LCS observability scripts and the standalone Prometheus/Grafana setup:
| File |
Role |
observability/install_slurm_exporter.sh |
Main install script (builds from source, requires Go) |
observability/install_observability.py |
Orchestrator that calls the install script |
observability/stop_observability.py |
Stops the systemd service |
observability/LICENSE_SLURM_EXPORTER.txt |
License for current exporter |
observability/otel_config/config-head-template.yaml |
Prometheus scrape config (target port) |
4.validation_and_observability/4.prometheus-grafana/README.md |
Links to archived vpenso repo and Grafana dashboard 4323 |
4.validation_and_observability/4.prometheus-grafana/update-prometheus.sh |
Scrape config for standalone setup |
4.validation_and_observability/4.prometheus-grafana/1click-dashboards-deployment/dashboards/create_ml_dashboards.py |
Imports Grafana dashboard 4323 |
Not affected
The Slinky / EKS path (3.test_cases/19.slinky-on-eks/) already uses SlinkyProject/slurm-exporter and is not impacted.
Grafana dashboard
The current setup imports Grafana dashboard 4323, which was designed for the vpenso/SckyzO exporter metric names. A different exporter will likely expose different metric names, so the dashboard may need to be replaced or adapted.
Reviewer requirement
Any PR for this change must be assigned to hyperpod-lcs-dev for SageMaker service team review, since these scripts run as HyperPod Lifecycle Configuration.
Feature Description & Motivation
The observability stack currently uses
SckyzO/slurm_exporterv1.1.0 as the Prometheus exporter for Slurm metrics. The author has announced that this exporter will no longer be actively maintained starting with Slurm 25.11, because Slurm will natively integrate OpenMetrics at that point.Why this matters now:
observability/install_slurm_exporter.sh) builds the exporter from source, which requires installing the Go toolchain on every head node. This adds complexity, build time, and a transient dependency to the Lifecycle Configuration Script (LCS).Related issues: #492 (stale), #644 (stale).
Category
Enhancement to existing test case
Alternatives Considered
rivosinc/prometheus-slurm-exportersckyzo/slurm_prometheus_exporterslurmctld; no external exporter requiredEach option has different trade-offs around Slurm version compatibility, packaging complexity, and metric coverage. The right choice likely depends on the HyperPod Slurm upgrade timeline.
Additional Context
Affected files
Eight files across the LCS observability scripts and the standalone Prometheus/Grafana setup:
observability/install_slurm_exporter.shobservability/install_observability.pyobservability/stop_observability.pyobservability/LICENSE_SLURM_EXPORTER.txtobservability/otel_config/config-head-template.yaml4.validation_and_observability/4.prometheus-grafana/README.mdvpensorepo and Grafana dashboard 43234.validation_and_observability/4.prometheus-grafana/update-prometheus.sh4.validation_and_observability/4.prometheus-grafana/1click-dashboards-deployment/dashboards/create_ml_dashboards.pyNot affected
The Slinky / EKS path (
3.test_cases/19.slinky-on-eks/) already usesSlinkyProject/slurm-exporterand is not impacted.Grafana dashboard
The current setup imports Grafana dashboard 4323, which was designed for the
vpenso/SckyzOexporter metric names. A different exporter will likely expose different metric names, so the dashboard may need to be replaced or adapted.Reviewer requirement
Any PR for this change must be assigned to hyperpod-lcs-dev for SageMaker service team review, since these scripts run as HyperPod Lifecycle Configuration.