Skip to content

[Feature]: Migrate from deprecated SckyzO/slurm_exporter to a maintained Prometheus exporter for Slurm #974

@KeitaW

Description

@KeitaW

Feature Description & Motivation

The observability stack currently uses SckyzO/slurm_exporter v1.1.0 as the Prometheus exporter for Slurm metrics. The author has announced that this exporter will no longer be actively maintained starting with Slurm 25.11, because Slurm will natively integrate OpenMetrics at that point.

Why this matters now:

  • HyperPod currently ships Slurm 24.11, so the existing exporter still works today, but it is already in maintenance-only mode with no new features or bug-fix guarantees.
  • Slurm 25.11 will include native OpenMetrics support, making external exporters optional — but HyperPod is not yet on 25.11.
  • The current install script (observability/install_slurm_exporter.sh) builds the exporter from source, which requires installing the Go toolchain on every head node. This adds complexity, build time, and a transient dependency to the Lifecycle Configuration Script (LCS).

Related issues: #492 (stale), #644 (stale).

Category

Enhancement to existing test case

Alternatives Considered

Exporter Status Slurm Req Notes
rivosinc/prometheus-slurm-exporter Active (v1.8.0, Sep 2025, 65+ stars) Current Slurm Go binary with CLI fallback mode; no compiled Slurm plugins needed
sckyzo/slurm_prometheus_exporter New (same author as old exporter) Slurm 25.11+ Uses native OpenMetrics endpoint, YAML config, MIT license
Slurm 25.11 native OpenMetrics Not yet available on HyperPod Slurm 25.11+ Built-in to slurmctld; no external exporter required

Each option has different trade-offs around Slurm version compatibility, packaging complexity, and metric coverage. The right choice likely depends on the HyperPod Slurm upgrade timeline.

Additional Context

Affected files

Eight files across the LCS observability scripts and the standalone Prometheus/Grafana setup:

File Role
observability/install_slurm_exporter.sh Main install script (builds from source, requires Go)
observability/install_observability.py Orchestrator that calls the install script
observability/stop_observability.py Stops the systemd service
observability/LICENSE_SLURM_EXPORTER.txt License for current exporter
observability/otel_config/config-head-template.yaml Prometheus scrape config (target port)
4.validation_and_observability/4.prometheus-grafana/README.md Links to archived vpenso repo and Grafana dashboard 4323
4.validation_and_observability/4.prometheus-grafana/update-prometheus.sh Scrape config for standalone setup
4.validation_and_observability/4.prometheus-grafana/1click-dashboards-deployment/dashboards/create_ml_dashboards.py Imports Grafana dashboard 4323

Not affected

The Slinky / EKS path (3.test_cases/19.slinky-on-eks/) already uses SlinkyProject/slurm-exporter and is not impacted.

Grafana dashboard

The current setup imports Grafana dashboard 4323, which was designed for the vpenso/SckyzO exporter metric names. A different exporter will likely expose different metric names, so the dashboard may need to be replaced or adapted.

Reviewer requirement

Any PR for this change must be assigned to hyperpod-lcs-dev for SageMaker service team review, since these scripts run as HyperPod Lifecycle Configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions