Skip to content

[Feature]: Add AWS ParallelCluster compute node monitoring support with node_exporter #1042

@littlemex

Description

@littlemex

Feature Description & Motivation

Currently, the OSS Grafana observability setup in 4.validation_and_observability/4.prometheus-grafana/ only supports SageMaker HyperPod clusters. AWS ParallelCluster users cannot monitor compute nodes because:

  1. No installation script for ParallelCluster: The existing setup assumes HyperPod Lifecycle Scripts (Docker-based deployment), which are incompatible with ParallelCluster's OnNodeConfigured hooks
  2. Configuration issues in prometheus-agent-collector.yaml:
    • scrape_timeout: 5m violates Prometheus specification (timeout must be ≤ scrape_interval)
    • Unnecessary instance-type filter limits monitoring to specific GPU types only

This feature request proposes adding ParallelCluster support with:

1. New installation script: 1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh

  • Binary-based deployment (compatible with OnNodeConfigured hooks)
  • Supports both amd64 and arm64 architectures
  • SHA256 checksum verification with retry logic
  • Can be referenced directly via GitHub raw URL (zero setup cost)

2. Documentation updates: README-OS-grafana.md

  • Add platform support statement for both HyperPod and ParallelCluster
  • Add platform-specific compute node setup instructions

3. Configuration fixes: prometheus-agent-collector.yaml

  • Fix scrape_timeout: 5m30s (complies with Prometheus 3.x validation)
  • Remove instance-type filter from efa_node_exporter job (node_exporter is hardware-agnostic)

Benefits:

  • Enables monitoring for all ParallelCluster compute node types (CPU, GPU)
  • Unified observability solution for both HyperPod and ParallelCluster
  • Complies with Prometheus specification
  • Simple deployment via GitHub raw URL

ParallelCluster configuration example:

Scheduling:
  SlurmQueues:
    - Name: compute
      CustomActions:
        OnNodeConfigured:
          Sequence:
            - Script: https://raw.githubusercontent.com/awslabs/awsome-distributed-training/main/1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh
      Networking:
        AdditionalSecurityGroups:
          - <PrometheusClusterSecurityGroup>

Alternatives Considered

Alternative 1: Docker-based deployment for ParallelCluster

  • Rejected: ParallelCluster's OnNodeConfigured hooks execute during node bootstrap before Docker is available. Would require significant cluster configuration changes.

Alternative 2: Keep instance-type filter

  • Rejected: node_exporter is hardware-agnostic and collects OS metrics regardless of instance type. Filtering limits monitoring to GPU instances only, excluding CPU nodes.

Alternative 3: Keep original scrape_timeout value

  • Rejected: Prometheus 3.x validates that scrape_timeoutscrape_interval. The current 5m timeout with 1m interval violates this specification and may cause errors in future Prometheus versions.

Additional Context

Workshop integration:

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions