[Feature]: Add AWS ParallelCluster compute node monitoring support with node_exporter

### Feature Description & Motivation

Currently, the OSS Grafana observability setup in `4.validation_and_observability/4.prometheus-grafana/` only supports SageMaker HyperPod clusters. AWS ParallelCluster users cannot monitor compute nodes because:

1. **No installation script for ParallelCluster**: The existing setup assumes HyperPod Lifecycle Scripts (Docker-based deployment), which are incompatible with ParallelCluster's `OnNodeConfigured` hooks
2. **Configuration issues in prometheus-agent-collector.yaml**:
   - `scrape_timeout: 5m` violates Prometheus specification (timeout must be ≤ scrape_interval)
   - Unnecessary `instance-type` filter limits monitoring to specific GPU types only

This feature request proposes adding ParallelCluster support with:

**1. New installation script**: `1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh`
- Binary-based deployment (compatible with `OnNodeConfigured` hooks)
- Supports both amd64 and arm64 architectures
- SHA256 checksum verification with retry logic
- Can be referenced directly via GitHub raw URL (zero setup cost)

**2. Documentation updates**: `README-OS-grafana.md`
- Add platform support statement for both HyperPod and ParallelCluster
- Add platform-specific compute node setup instructions

**3. Configuration fixes**: `prometheus-agent-collector.yaml`
- Fix `scrape_timeout: 5m` → `30s` (complies with Prometheus 3.x validation)
- Remove `instance-type` filter from `efa_node_exporter` job (node_exporter is hardware-agnostic)

**Benefits**:
- Enables monitoring for **all** ParallelCluster compute node types (CPU, GPU)
- Unified observability solution for both HyperPod and ParallelCluster
- Complies with Prometheus specification
- Simple deployment via GitHub raw URL

**ParallelCluster configuration example**:
```yaml
Scheduling:
  SlurmQueues:
    - Name: compute
      CustomActions:
        OnNodeConfigured:
          Sequence:
            - Script: https://raw.githubusercontent.com/awslabs/awsome-distributed-training/main/1.architectures/2.aws-parallelcluster/utils/install-node-exporter.sh
      Networking:
        AdditionalSecurityGroups:
          - <PrometheusClusterSecurityGroup>
```

### Alternatives Considered

**Alternative 1: Docker-based deployment for ParallelCluster**
- **Rejected**: ParallelCluster's `OnNodeConfigured` hooks execute during node bootstrap before Docker is available. Would require significant cluster configuration changes.

**Alternative 2: Keep instance-type filter**
- **Rejected**: node_exporter is hardware-agnostic and collects OS metrics regardless of instance type. Filtering limits monitoring to GPU instances only, excluding CPU nodes.

**Alternative 3: Keep original scrape_timeout value**
- **Rejected**: Prometheus 3.x validates that `scrape_timeout` ≤ `scrape_interval`. The current `5m` timeout with `1m` interval violates this specification and may cause errors in future Prometheus versions.

### Additional Context

**Workshop integration**:
- This feature is being used in the AWS ParallelCluster workshop: https://catalog.workshops.aws/ml-on-aws-parallelcluster
**Prometheus specification reference**:
- scrape_timeout specification: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
- "Per-scrape timeout when scraping this target. ... Must be greater than 0 and less than the scrape_interval."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add AWS ParallelCluster compute node monitoring support with node_exporter #1042

Feature Description & Motivation

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add AWS ParallelCluster compute node monitoring support with node_exporter #1042

Description

Feature Description & Motivation

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions