Feature Description & Motivation
Currently, the OSS Grafana observability setup in 4.validation_and_observability/4.prometheus-grafana/ only supports SageMaker HyperPod clusters. AWS ParallelCluster users cannot monitor compute nodes because:
- No installation script for ParallelCluster: The existing setup assumes HyperPod Lifecycle Scripts (Docker-based deployment), which are incompatible with ParallelCluster's
OnNodeConfigured hooks
- Configuration issues in prometheus-agent-collector.yaml:
scrape_timeout: 5m violates Prometheus specification (timeout must be ≤ scrape_interval)
- Unnecessary
instance-type filter limits monitoring to specific GPU types only
This feature request proposes adding ParallelCluster support with:
1. New installation script: 1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh
- Binary-based deployment (compatible with
OnNodeConfigured hooks)
- Supports both amd64 and arm64 architectures
- SHA256 checksum verification with retry logic
- Can be referenced directly via GitHub raw URL (zero setup cost)
2. Documentation updates: README-OS-grafana.md
- Add platform support statement for both HyperPod and ParallelCluster
- Add platform-specific compute node setup instructions
3. Configuration fixes: prometheus-agent-collector.yaml
- Fix
scrape_timeout: 5m → 30s (complies with Prometheus 3.x validation)
- Remove
instance-type filter from efa_node_exporter job (node_exporter is hardware-agnostic)
Benefits:
- Enables monitoring for all ParallelCluster compute node types (CPU, GPU)
- Unified observability solution for both HyperPod and ParallelCluster
- Complies with Prometheus specification
- Simple deployment via GitHub raw URL
ParallelCluster configuration example:
Scheduling:
SlurmQueues:
- Name: compute
CustomActions:
OnNodeConfigured:
Sequence:
- Script: https://raw.githubusercontent.com/awslabs/awsome-distributed-training/main/1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh
Networking:
AdditionalSecurityGroups:
- <PrometheusClusterSecurityGroup>
Alternatives Considered
Alternative 1: Docker-based deployment for ParallelCluster
- Rejected: ParallelCluster's
OnNodeConfigured hooks execute during node bootstrap before Docker is available. Would require significant cluster configuration changes.
Alternative 2: Keep instance-type filter
- Rejected: node_exporter is hardware-agnostic and collects OS metrics regardless of instance type. Filtering limits monitoring to GPU instances only, excluding CPU nodes.
Alternative 3: Keep original scrape_timeout value
- Rejected: Prometheus 3.x validates that
scrape_timeout ≤ scrape_interval. The current 5m timeout with 1m interval violates this specification and may cause errors in future Prometheus versions.
Additional Context
Workshop integration:
Feature Description & Motivation
Currently, the OSS Grafana observability setup in
4.validation_and_observability/4.prometheus-grafana/only supports SageMaker HyperPod clusters. AWS ParallelCluster users cannot monitor compute nodes because:OnNodeConfiguredhooksscrape_timeout: 5mviolates Prometheus specification (timeout must be ≤ scrape_interval)instance-typefilter limits monitoring to specific GPU types onlyThis feature request proposes adding ParallelCluster support with:
1. New installation script:
1.architectures/2.aws-parallelcluster/utils/install-node-exporter.shOnNodeConfiguredhooks)2. Documentation updates:
README-OS-grafana.md3. Configuration fixes:
prometheus-agent-collector.yamlscrape_timeout: 5m→30s(complies with Prometheus 3.x validation)instance-typefilter fromefa_node_exporterjob (node_exporter is hardware-agnostic)Benefits:
ParallelCluster configuration example:
Alternatives Considered
Alternative 1: Docker-based deployment for ParallelCluster
OnNodeConfiguredhooks execute during node bootstrap before Docker is available. Would require significant cluster configuration changes.Alternative 2: Keep instance-type filter
Alternative 3: Keep original scrape_timeout value
scrape_timeout≤scrape_interval. The current5mtimeout with1minterval violates this specification and may cause errors in future Prometheus versions.Additional Context
Workshop integration:
Prometheus specification reference: